Roi Polanitzer's final project in the course "Data science, Machine Learning & Deep Learning with Python"

In [1]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
import plotly.offline as ply
import plotly.graph_objects as go
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE
import statsmodels.api as sm
from sklearn import metrics
import warnings
from IPython.display import Image 
from sklearn.linear_model import LinearRegression
from sklearn import metrics
from itertools import combinations
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, f1_score, log_loss
from sklearn.metrics import confusion_matrix, classification_report
from pandas.plotting import scatter_matrix
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import MinMaxScaler
from sklearn.feature_selection import VarianceThreshold
from sklearn.ensemble import VotingClassifier
from sklearn.ensemble import RandomForestRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.model_selection import GridSearchCV
warnings.filterwarnings('ignore')
Using TensorFlow backend.
C:\ProgramData\Anaconda3\lib\site-packages\tensorflow\python\framework\dtypes.py:526: FutureWarning:

Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.

C:\ProgramData\Anaconda3\lib\site-packages\tensorflow\python\framework\dtypes.py:527: FutureWarning:

Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.

C:\ProgramData\Anaconda3\lib\site-packages\tensorflow\python\framework\dtypes.py:528: FutureWarning:

Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.

C:\ProgramData\Anaconda3\lib\site-packages\tensorflow\python\framework\dtypes.py:529: FutureWarning:

Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.

C:\ProgramData\Anaconda3\lib\site-packages\tensorflow\python\framework\dtypes.py:530: FutureWarning:

Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.

C:\ProgramData\Anaconda3\lib\site-packages\tensorflow\python\framework\dtypes.py:535: FutureWarning:

Passing (type, 1) or '1type' as a synonym of type is deprecated; in a future version of numpy, it will be understood as (type, (1,)) / '(1,)type'.

Credit Decision Overview

Lending institutions all over the world need to classify whether a loan is acceptable (i.e., good) or not (i.e., default) to make their lending decisions. They can do this by relating the outcomes of those loans which were given (1= default, 0= good) to features such as home ownership, annual income, and the debt to income ratio of the credit applicant.

In [2]:
Image(filename='Lending-Club.jpg')
Out[2]:

Problem Statement

Classification- Predict if a loan given to credit applicant by LendingClub will default (Yes/No)

Gain insights on the parameters that affect the default of loan given to credit applicant by LendingClub according to various parameters and build a model that predicts if loan was defaulted

  • When the loan amount is higher -> there are more chances to get defaults?
  • When the loan term is longer -> there are more chances to get defaults?
  • When the interest rate is higher -> there are more chances to get defaults?
  • When the monthly payment is lower -> there are more chances to get defaults?
  • How does the applicant's loan purpose affect the loan? If it is for buying a house or for vacation there are more chances to get defaults?
  • When the applicant's employment length is lower -> there are more chances to get defaults?
  • Does the applicant rent or own his own home -> there are more chances to get defaults?
  • When the applicant's annual income is lower -> there are more chances to get defaults?
  • When the applicant's debt to income ratio is higher -> there are more chances to get defaults?
  • and more...

Or Maybe we don't have enough information to classify correctly if loan was defaulted and it depends on more feature like, lender character?

Classification work flow

  1. Reading the Datasets
  2. Data cleaning
  3. Feature Engineering
  4. Exploratory data analysis
  5. Creating dummy variables
  6. Handling imbalanced data
  7. Split data to Train & Test sets
  8. Sanity check between full data and selected data for prediction
  9. Training & Evaluating different machine learning classification models
  10. Comparing several machine learning classification models
  11. Training & Evaluating different ensemble learning classification methods
  12. Comparing several ensemble learning classification methods
  13. Training & Evaluating deep learning classification model
  14. Choosing the best model
  15. Predict+Evaluate on test set with chosen model
  16. Real time predictions
  17. Deploymnet

The Data

  • LendingClub_data.csv - contains loans records provided by the company Lending Club on its credit decisions.

Lending Club is a peer-to-peer lender that allows investors to lend money to borrowers witout intermediary being involved.

Data dictionary

  • LOAN_ID- The unique Lending Club assigned ID for the loan listing.
  • BORROWER_ID- The unique Lending Club assigned Id for the borrower member.
  • LOAN_AMOUNT- The listed amount of the loan applied for by the borrower.
  • LOAN_TERM- The number of payments on the loan (36 months or 60 months)
  • INTEREST_RATE- The Interest Rate on the loan.
  • MONTHLY_PAYMENT- The monthly payment owed by the borrower if the loan originates.
  • LOAN_PURPOSE- A category provided by the borrower for the loan request (debt consolidation, credit card, small business, medical, other, vacation, house, major purchase, home improvement, wedding, car, moving, renewable energy, and educational)
  • EMPLOYMENT LENGTH- Employment length in years
  • HOUSING- The home ownership status provided by the borrower during registration or obtained from the credit report. (Yes for Owner/Mortgage and No for Rent)
  • ANNUAL INCOME- The self-reported annual income provided by the borrower during registration.
  • DEBT TO INCOME- The ratio calculated using the borrower’s total monthly debt payments on the total debt obligations, excluding mortgage and the requested Lending Club loan, divided by the borrower’s self-reported monthly income.
  • DEFAULT- The current status of the loan (1 if Charged Off/Default 0 if Fully Paid)

1. Reading the Datasets

In [3]:
#reading loan df
df= pd.read_csv('Lending_Club.csv')
In [4]:
df.columns
Out[4]:
Index(['loan_id', 'borrower_id', 'loan_amount', 'loan_term', 'interest_rate',
       'monthly_payment', 'loan_purpose', 'employment_length', 'housing',
       'annual_income', 'debt_to_income', 'Default'],
      dtype='object')
In [5]:
df.describe().T
Out[5]:
count mean std min 25% 50% 75% max
loan_id 24999.0 8.261503e+05 4.788045e+05 6.00 408558.00 824416.00 1.243298e+06 1.646774e+06
borrower_id 24999.0 6.083746e+07 3.606363e+07 68400.00 31001076.50 65373097.00 9.124253e+07 6.089031e+08
loan_amount 24999.0 1.468700e+04 8.763621e+03 600.00 8000.00 12725.00 2.000000e+04 4.000000e+04
interest_rate 24999.0 1.321884e+01 4.741118e+00 5.32 9.76 12.74 1.599000e+01 3.099000e+01
monthly_payment 24999.0 4.367890e+02 2.560091e+02 21.59 251.58 377.62 5.770350e+02 1.501000e+03
annual_income 24999.0 7.706611e+04 5.497013e+04 0.00 46000.00 65000.00 9.200000e+04 1.500000e+06
debt_to_income 24999.0 1.860146e+01 1.404300e+01 0.00 12.05 17.81 2.427000e+01 9.990000e+02
Default 24999.0 1.022841e-01 3.030276e-01 0.00 0.00 0.00 0.000000e+00 1.000000e+00
In [6]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24999 entries, 0 to 24998
Data columns (total 12 columns):
loan_id              24999 non-null int64
borrower_id          24999 non-null int64
loan_amount          24999 non-null int64
loan_term            24999 non-null object
interest_rate        24999 non-null float64
monthly_payment      24999 non-null float64
loan_purpose         24999 non-null object
employment_length    23497 non-null object
housing              24999 non-null object
annual_income        24999 non-null float64
debt_to_income       24999 non-null float64
Default              24999 non-null int64
dtypes: float64(4), int64(4), object(4)
memory usage: 2.3+ MB

2. Data Cleaning

In [7]:
dfColumns=[col.strip().upper() for col in df.columns]
df.columns=dfColumns
print("Lending Club DF with NA values:")
print(df.columns[df.isna().any()].tolist())
Lending Club DF with NA values:
['EMPLOYMENT_LENGTH']
In [8]:
df.columns
Out[8]:
Index(['LOAN_ID', 'BORROWER_ID', 'LOAN_AMOUNT', 'LOAN_TERM', 'INTEREST_RATE',
       'MONTHLY_PAYMENT', 'LOAN_PURPOSE', 'EMPLOYMENT_LENGTH', 'HOUSING',
       'ANNUAL_INCOME', 'DEBT_TO_INCOME', 'DEFAULT'],
      dtype='object')
In [9]:
print("Counting NA values per recognized columns with NA:")
print("EMPLOYMENT_LENGTH NA Valus:"+ str(df.EMPLOYMENT_LENGTH.isna().sum()))
Counting NA values per recognized columns with NA:
EMPLOYMENT_LENGTH NA Valus:1502
In [10]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 24999 entries, 0 to 24998
Data columns (total 12 columns):
LOAN_ID              24999 non-null int64
BORROWER_ID          24999 non-null int64
LOAN_AMOUNT          24999 non-null int64
LOAN_TERM            24999 non-null object
INTEREST_RATE        24999 non-null float64
MONTHLY_PAYMENT      24999 non-null float64
LOAN_PURPOSE         24999 non-null object
EMPLOYMENT_LENGTH    23497 non-null object
HOUSING              24999 non-null object
ANNUAL_INCOME        24999 non-null float64
DEBT_TO_INCOME       24999 non-null float64
DEFAULT              24999 non-null int64
dtypes: float64(4), int64(4), object(4)
memory usage: 2.3+ MB
In [11]:
# filling recordes with A values
df_loan = df.copy()
df_loan.fillna(method='ffill', inplace=True)
# In Addition removing unnecessary column
df_loan=df_loan.drop(["LOAN_ID","BORROWER_ID"], axis=1)
print("Current columns with NA values")
print(df_loan.columns[df_loan.isna().any()].tolist())
#Removing records where is NA
df_loan=df_loan.dropna()
Current columns with NA values
[]
In [12]:
df_loan.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 24999 entries, 0 to 24998
Data columns (total 10 columns):
LOAN_AMOUNT          24999 non-null int64
LOAN_TERM            24999 non-null object
INTEREST_RATE        24999 non-null float64
MONTHLY_PAYMENT      24999 non-null float64
LOAN_PURPOSE         24999 non-null object
EMPLOYMENT_LENGTH    24999 non-null object
HOUSING              24999 non-null object
ANNUAL_INCOME        24999 non-null float64
DEBT_TO_INCOME       24999 non-null float64
DEFAULT              24999 non-null int64
dtypes: float64(4), int64(2), object(4)
memory usage: 2.1+ MB
In [13]:
df_loan.describe().style.apply(lambda x: ["background: yellow" if v <= 0  else "" for v in x], axis = 1)
Out[13]:
LOAN_AMOUNT INTEREST_RATE MONTHLY_PAYMENT ANNUAL_INCOME DEBT_TO_INCOME DEFAULT
count 24999 24999 24999 24999 24999 24999
mean 14687 13.2188 436.789 77066.1 18.6015 0.102284
std 8763.62 4.74112 256.009 54970.1 14.043 0.303028
min 600 5.32 21.59 0 0 0
25% 8000 9.76 251.58 46000 12.05 0
50% 12725 12.74 377.62 65000 17.81 0
75% 20000 15.99 577.035 92000 24.27 0
max 40000 30.99 1501 1.5e+06 999 1
In [14]:
print("LOAN_AMOUNT unique values:"+str(df_loan.LOAN_AMOUNT.unique()))
print("LOAN_TERM unique values:"+str(df_loan.LOAN_TERM.unique()))
print("INTEREST_RATE unique values:"+str(df_loan.INTEREST_RATE.unique()))
print("MONTHLY_PAYMENT unique values:"+str(df_loan.MONTHLY_PAYMENT.unique()))
print("LOAN_PURPOSE unique values:"+str(df_loan.LOAN_PURPOSE.unique()))
print("EMPLOYMENT_LENGTH unique values:"+str(df_loan.EMPLOYMENT_LENGTH.unique()))
print("HOUSING unique values:"+str(df_loan.HOUSING.unique()))
print("ANNUAL_INCOME unique values:"+str(df_loan.ANNUAL_INCOME.unique()))
print("DEBT_TO_INCOME unique values:"+str(df_loan.DEBT_TO_INCOME.unique()))
print("DEFAULT unique values:"+str(df_loan.DEFAULT.unique()))
print("LOAN_AMOUNT values count:")
print(df_loan.LOAN_AMOUNT.value_counts())
print("LOAN_TERM values count:")
print(df_loan.LOAN_TERM.value_counts())
print("LOAN_PURPOSE values count:")
print(df_loan.LOAN_PURPOSE.value_counts())
print("EMPLOYMENT_LENGTH values count:")
print(df_loan.EMPLOYMENT_LENGTH.value_counts())
print("HOUSING values count:")
print(df_loan.HOUSING.value_counts())
print("ANNUAL_INCOME values count:")
print(df_loan.ANNUAL_INCOME.value_counts())
print("DEBT_TO_INCOME values count:")
print(df_loan.DEBT_TO_INCOME.value_counts())
print("DEFAULT values count:")
print(df_loan.DEFAULT.value_counts())
LOAN_AMOUNT unique values:[20000 30000 21500 ... 34700 20575 25300]
LOAN_TERM unique values:[' 60 months' ' 36 months']
INTEREST_RATE unique values:[17.93 11.99 13.67  8.49 30.74 14.08 12.39 12.79 10.16  9.99 11.67  9.93
 15.31 18.99 18.55  7.35 19.19  8.39  7.07 12.74 13.66 19.99 10.99  9.17
 13.99 13.49 14.46 11.39 12.99 11.44 10.75 18.25 10.49 26.24 15.61  6.62
 14.49 16.99 15.41 16.29 14.65  8.24 28.34 13.53 15.99 25.49 11.14 22.91
  8.99 16.55 20.99 14.99 24.49  7.29 13.98 11.49  7.9  12.69 17.77 16.32
 22.35 14.85 22.4  15.59 12.29 25.57 17.99 10.42  9.16  6.92  9.75 15.05
  9.49 20.49 13.33  7.89 20.    7.49 11.53 12.62 13.18 14.31 10.64 21.
 14.33  8.59 12.12  6.97  7.26  7.51  8.67 11.48 10.15  7.97 14.98 16.02
 17.86  7.62 17.57 24.85  5.32 16.59 10.91 13.59  6.24  7.24 13.44  6.99
 25.69 12.61  7.12 19.03 13.11 13.65 24.99  6.03 23.99 19.24 25.88 21.6
  8.19 22.74  7.59 15.8  23.88 10.74 18.06 12.59 25.8  12.49  7.99 20.2
 16.4  26.06 17.27 25.28 13.23  5.42 13.92 20.31 12.88 17.97  9.71 21.99
  6.49  6.   12.05 23.1   5.99 21.48 14.3   9.44 21.45 25.83 19.52 30.94
 13.35 20.5   6.89  7.91  8.38 17.09 14.61 25.99 20.77  9.67 11.55 19.47
 10.08 30.75 18.2   7.39 23.43 16.78  8.9  14.47 12.21 18.24 23.13  7.66
 21.18  7.21 14.16 27.31  8.18 11.47 17.56 14.27 20.75 15.1  15.88 27.49
  5.93 15.77  9.91 14.64 29.49  8.6  21.67 16.77  6.91 17.58 21.98  6.68
 12.35 10.37 13.06 19.05 22.45 22.99 18.49 30.17 21.49  7.69 17.14 12.85
 24.5  22.2  19.2  19.97  7.88 18.84 18.67 15.7  16.49 22.15 11.58 11.71
  9.8  21.15 22.7  28.88 26.77 18.62 22.39 24.74 18.3  15.27 13.68 24.11
 14.09 23.32 19.53 28.72  7.14 21.97 17.76 10.78 24.08 12.73  6.39 25.11
 15.21 18.75 14.12  9.76 30.49  8.88 10.   16.2  14.48 11.22 10.65 18.54
 23.76 29.69 23.4  13.05 18.92 18.85 19.91 19.69 14.83 26.3  23.26 21.7
 17.88 25.29 24.33 23.28 15.76 25.89 16.24 15.81 19.72  9.62 28.99 30.99
 25.65 19.22 28.67 11.11 15.22 11.86 23.91 11.83 26.57 17.1  26.49  9.63
 25.78 12.23 28.14 15.95 22.95 28.69  6.76 19.29  7.68 24.89 19.89 27.79
 14.79  5.79 25.82  8.94 15.58 23.5  11.36  6.17 29.96 30.89 18.39  9.25
 26.99 12.42  6.54 30.79 22.9  23.63 13.16 22.11 29.99 18.64 14.26 11.26
 14.22 12.84 11.89  8.   11.28 16.89 12.68 22.47 23.7   9.32 13.85 10.62
 10.2  13.61 27.34 13.72 27.88 12.53 20.3  10.36 16.82 14.59 24.83 15.65
 30.65 14.96 19.48 14.91 20.48 15.96 13.48 16.45  9.33 18.17 14.72 17.03
 26.14 12.18 18.79 23.83 11.03 13.84 16.95 18.09 10.01 10.59 24.7  28.18
  9.07 25.44 20.8   7.4  17.19 13.22  7.74 15.23  9.83 29.67 15.33 20.25
 15.28 13.57 25.09 20.89  9.96 10.39 15.57 30.84 14.17 10.25 19.42 13.8
 20.53 11.34 16.63 11.97 11.66 14.18 18.53 15.07 20.11 11.12 16.   27.99
 15.13 17.66 14.75 16.35 13.75 10.28 17.04  9.88 13.43 22.85 12.87 10.38
 13.47 12.98 13.55 15.68 15.2  15.37 12.54 14.74 28.49 17.49 16.69 10.96]
MONTHLY_PAYMENT unique values:[342.94 996.29 714.01 ... 140.24 218.99 280.66]
LOAN_PURPOSE unique values:['debt_consolidation' 'credit_card' 'small_business' 'medical' 'other'
 'vacation' 'house' 'major_purchase' 'home_improvement' 'wedding' 'car'
 'moving' 'renewable_energy' 'educational']
EMPLOYMENT_LENGTH unique values:['1 year' '10+ years' '8 years' '9 years' '5 years' '3 years' '4 years'
 '< 1 year' '7 years' '2 years' '6 years']
HOUSING unique values:['yes' 'no']
ANNUAL_INCOME unique values:[ 44304.   136000.    50000.   ...  42494.    14988.    74904.64]
DEBT_TO_INCOME unique values:[18.47 20.63 29.62 ... 36.91 36.6  68.21]
DEFAULT unique values:[1 0]
LOAN_AMOUNT values count:
10000    1762
12000    1391
20000    1369
15000    1329
35000     962
         ... 
31350       1
21075       1
35400       1
27150       1
32800       1
Name: LOAN_AMOUNT, Length: 1186, dtype: int64
LOAN_TERM values count:
 36 months    17865
 60 months     7134
Name: LOAN_TERM, dtype: int64
LOAN_PURPOSE values count:
debt_consolidation    14555
credit_card            5470
home_improvement       1613
other                  1460
major_purchase          530
small_business          290
medical                 285
car                     273
vacation                203
moving                  172
house                   104
wedding                  32
renewable_energy         11
educational               1
Name: LOAN_PURPOSE, dtype: int64
EMPLOYMENT_LENGTH values count:
10+ years    8907
2 years      2392
3 years      2100
< 1 year     2072
1 year       1694
4 years      1613
5 years      1566
6 years      1316
8 years      1147
7 years      1145
9 years      1047
Name: EMPLOYMENT_LENGTH, dtype: int64
HOUSING values count:
yes    15031
no      9968
Name: HOUSING, dtype: int64
ANNUAL_INCOME values count:
60000.0     951
50000.0     839
65000.0     722
70000.0     715
40000.0     648
           ... 
36120.0       1
31668.0       1
416000.0      1
40920.0       1
32254.0       1
Name: ANNUAL_INCOME, Length: 3096, dtype: int64
DEBT_TO_INCOME values count:
0.00     27
20.60    23
19.20    21
14.42    21
17.23    21
         ..
1.53      1
38.95     1
2.45      1
53.79     1
50.73     1
Name: DEBT_TO_INCOME, Length: 3853, dtype: int64
DEFAULT values count:
0    22442
1     2557
Name: DEFAULT, dtype: int64

3. Feature Engineering

In [15]:
for i in range (0, len(df_loan['EMPLOYMENT_LENGTH'])):
    if df_loan['EMPLOYMENT_LENGTH'][i] == "< 1 year": 
        df_loan['EMPLOYMENT_LENGTH'][i]= "< 1 year"
    elif df_loan['EMPLOYMENT_LENGTH'][i] == "1 year" or df_loan['EMPLOYMENT_LENGTH'][i] == "2 years": 
        df_loan['EMPLOYMENT_LENGTH'][i]= "1-2 Years"
    elif df_loan['EMPLOYMENT_LENGTH'][i] == "3 years" or df_loan['EMPLOYMENT_LENGTH'][i] == "4 years": 
        df_loan['EMPLOYMENT_LENGTH'][i]= "3-4 Years"
    elif df_loan['EMPLOYMENT_LENGTH'][i] == "5 years" or df_loan['EMPLOYMENT_LENGTH'][i] == "6 years":  
        df_loan['EMPLOYMENT_LENGTH'][i]= "5-6 Years"
    elif df_loan['EMPLOYMENT_LENGTH'][i] == "7 years" or df_loan['EMPLOYMENT_LENGTH'][i] == "8 years":
        df_loan['EMPLOYMENT_LENGTH'][i]= "7-8 Years"
    elif df_loan['EMPLOYMENT_LENGTH'][i] == "9 years" or df_loan['EMPLOYMENT_LENGTH'][i] == "10 years":
        df_loan['EMPLOYMENT_LENGTH'][i]= "9-10 Years"
    else: 
        df_loan['EMPLOYMENT_LENGTH'][i]= ">10 Years"
In [16]:
print("EMPLOYMENT_LENGTH :", df_loan['EMPLOYMENT_LENGTH'].unique())
EMPLOYMENT_LENGTH : ['1-2 Years' '>10 Years' '7-8 Years' '9-10 Years' '5-6 Years' '3-4 Years'
 '< 1 year']

4. Exploratory data analysis

In [17]:
df_loan.describe().T
Out[17]:
count mean std min 25% 50% 75% max
LOAN_AMOUNT 24999.0 14687.002480 8763.621362 600.00 8000.00 12725.00 20000.000 40000.00
INTEREST_RATE 24999.0 13.218843 4.741118 5.32 9.76 12.74 15.990 30.99
MONTHLY_PAYMENT 24999.0 436.788993 256.009134 21.59 251.58 377.62 577.035 1501.00
ANNUAL_INCOME 24999.0 77066.106339 54970.127344 0.00 46000.00 65000.00 92000.000 1500000.00
DEBT_TO_INCOME 24999.0 18.601464 14.042998 0.00 12.05 17.81 24.270 999.00
DEFAULT 24999.0 0.102284 0.303028 0.00 0.00 0.00 0.000 1.00
In [18]:
left = df_loan.groupby('DEFAULT')
left.mean()
Out[18]:
LOAN_AMOUNT INTEREST_RATE MONTHLY_PAYMENT ANNUAL_INCOME DEBT_TO_INCOME
DEFAULT
0 14627.105427 12.934007 434.614986 77888.128034 18.406164
1 15212.700430 15.718756 455.869582 69851.475569 20.315549
In [19]:
sns.countplot(x='DEFAULT', data=df_loan)
plt.xlabel('Loan Outcome')
plt.ylabel("Number of Loans")
plt.title("Loan Classification")
plt.show()
In [20]:
pd.value_counts(df_loan['DEFAULT'].values,normalize=True)
Out[20]:
0    0.897716
1    0.102284
dtype: float64
In [21]:
df_loan['DEFAULT'].value_counts()
Out[21]:
0    22442
1     2557
Name: DEFAULT, dtype: int64
In [22]:
df_plot = df_loan['DEFAULT'].value_counts().reset_index()
plot_data = [
    go.Bar(
        x=df_plot['index'],
        y=df_plot['DEFAULT'],
        width = [0.5, 0.5],
        marker=dict(
        color=['blue', 'orange'])
    )
]
plot_layout = go.Layout(
        xaxis={"type": "category"},
        yaxis={"title": "DEFAULT"},
        title='DEFAULT',
        plot_bgcolor  = 'rgb(243,243,243)',
        paper_bgcolor  = 'rgb(243,243,243)',
    )
fig = go.Figure(data=plot_data, layout=plot_layout)
ply.iplot(fig)

There are 22442 no default loans and 2557 default loans in the outcome variables.

In [23]:
count_no_default = len(df_loan[df_loan['DEFAULT']==0])
count_default = len(df_loan[df_loan['DEFAULT']==1])
pct_of_no_default = count_no_default/(count_no_default+count_default)
print("percentage of no default is", pct_of_no_default*100)
pct_of_default = count_default/(count_no_default+count_default)
print("percentage of default", pct_of_default*100)
percentage of no default is 89.77159086363454
percentage of default 10.228409136365455
In [24]:
%matplotlib inline
pd.crosstab(df_loan.HOUSING,df_loan.DEFAULT).plot(kind='bar')
plt.title('Default Frequency for Home Ownership')
plt.xlabel('Home Ownership')
plt.ylabel('Number of Loans')
plt.savefig('default_fre_home')
In [25]:
table=pd.crosstab(df_loan['EMPLOYMENT_LENGTH'],df_loan['DEFAULT'])
table.div(table.sum(1).astype(float), axis=0).plot(kind='bar', stacked=True)
plt.title('Stacked Bar Chart of EMPLOYMENT_LENGTH vs Default')
plt.xlabel('EMPLOYMENT_LENGTH')
plt.ylabel('Proportion of Customers')
plt.savefig('mariral_vs_pur_stack')
In [26]:
%matplotlib inline
import matplotlib
matplotlib.rcParams['figure.figsize'] = (10.0, 10.0)
import matplotlib.pyplot as plt
import seaborn as sns
sns.kdeplot(df_loan['ANNUAL_INCOME'].loc[df_loan['DEFAULT'] == 0], label='not default', shade=True);
sns.kdeplot(df_loan['ANNUAL_INCOME'].loc[df_loan['DEFAULT'] == 1], label='not default', shade=True);
In [27]:
num_projects=df_loan.groupby('EMPLOYMENT_LENGTH').count()
plt.barh(num_projects.index.values, num_projects['LOAN_AMOUNT'], color=['blue', 'orange','green','red','purple','chocolate','pink','grey'])
plt.xlabel('Number of Loans')
plt.ylabel("borrower's Employment Length")
plt.title('Loans Frequency for Employment Length')
plt.show()
In [28]:
num_projects=df_loan.groupby('LOAN_PURPOSE').count()
plt.barh(num_projects.index.values, num_projects['LOAN_AMOUNT'], color=['blue', 'orange','green','red','purple','chocolate','pink','grey'])
plt.xlabel('Number of Loans')
plt.ylabel("borrower's Employment Length")
plt.title('Loans Frequency for Loan purpose')
plt.show()
In [29]:
%matplotlib inline
import matplotlib.pyplot as plt
df_loan.hist(bins=10, figsize=(20,15))
plt.savefig("attribute_histogram_plots")
plt.show()
In [30]:
features2=['LOAN_PURPOSE', 'EMPLOYMENT_LENGTH', 'HOUSING', 'DEFAULT']
fig=plt.subplots(figsize=(30,30))
for i, j in enumerate(features2):
    plt.subplot(4, 2, i+1)
    plt.subplots_adjust(hspace = 1.0)
    sns.countplot(x=j,data = df_loan)
    plt.xticks(rotation=90)
    plt.title("Number of Loans")
In [31]:
fig=plt.subplots(figsize=(30,30))
for i, j in enumerate(features2):
    plt.subplot(4, 2, i+1)
    plt.subplots_adjust(hspace = 1.0)
    sns.countplot(x=j,data = df_loan, hue='DEFAULT')
    plt.xticks(rotation=90)
    plt.title("Number of Loans")
In [32]:
df_plot = df_loan.groupby('LOAN_AMOUNT').DEFAULT.mean().reset_index()
plot_data = [
    go.Scatter(
        x=df_plot['LOAN_AMOUNT'],
        y=df_plot['DEFAULT'],
        mode='markers',
        name='Low',
        marker= dict(size= 7,
            line= dict(width=1),
            color= 'blue',
            opacity= 0.8
           ),
    )
]
plot_layout = go.Layout(
        yaxis= {'title': "DEFAULT"},
        xaxis= {'title': "LOAN_AMOUNT"},
        title='LOAN_AMOUNT vs DEFAULT',
        plot_bgcolor  = "rgb(243,243,243)",
        paper_bgcolor  = "rgb(243,243,243)",
    )
fig = go.Figure(data=plot_data, layout=plot_layout)
ply.iplot(fig)
In [33]:
df_loan.LOAN_AMOUNT.hist(bins=50, figsize=(20,15), color=['green'])
plt.savefig("LOAN_AMOUNT_histogram_plot")
plt.show()
In [34]:
#mean DEFAULT by LOAN_TERM
ax=df_loan.groupby(['LOAN_TERM']).mean().reindex([' 60 months', ' 36 months'])['DEFAULT'].plot.bar(figsize=(12, 5),color=['red','purple'])
ax.set_title('DEFAULT mean LOAN_TERM')
ax.set_ylabel('DEFAULT mean')
Out[34]:
Text(0, 0.5, 'DEFAULT mean')
In [35]:
df_plot = df_loan.groupby('INTEREST_RATE').DEFAULT.mean().reset_index()
plot_data = [
    go.Scatter(
        x=df_plot['INTEREST_RATE'],
        y=df_plot['DEFAULT'],
        mode='markers',
        name='Low',
        marker= dict(size= 7,
            line= dict(width=1),
            color= 'blue',
            opacity= 0.8
           ),
    )
]
plot_layout = go.Layout(
        yaxis= {'title': "DEFAULT"},
        xaxis= {'title': "INTEREST_RATE"},
        title='INTEREST_RATE vs DEFAULT',
        plot_bgcolor  = "rgb(243,243,243)",
        paper_bgcolor  = "rgb(243,243,243)",
    )
fig = go.Figure(data=plot_data, layout=plot_layout)
ply.iplot(fig)
In [36]:
df_loan.INTEREST_RATE.hist(bins=50, figsize=(20,15),color=['chocolate'])
plt.savefig("INTEREST_RATE_histogram_plot")
plt.show()
In [37]:
df_plot = df_loan.groupby('MONTHLY_PAYMENT').DEFAULT.mean().reset_index()
plot_data = [
    go.Scatter(
        x=df_plot['MONTHLY_PAYMENT'],
        y=df_plot['DEFAULT'],
        mode='markers',
        name='Low',
        marker= dict(size= 7,
            line= dict(width=1),
            color= 'blue',
            opacity= 0.8
           ),
    )
]
plot_layout = go.Layout(
        yaxis= {'title': "DEFAULT"},
        xaxis= {'title': "MONTHLY_PAYMENT"},
        title='MONTHLY_PAYMENT vs DEFAULT',
        plot_bgcolor  = "rgb(243,243,243)",
        paper_bgcolor  = "rgb(243,243,243)",
    )
fig = go.Figure(data=plot_data, layout=plot_layout)
ply.iplot(fig)
In [38]:
df_loan.MONTHLY_PAYMENT.hist(bins=50, figsize=(20,15),color=['pink'])
plt.savefig("MONTHLY_PAYMENT_histogram_plot")
plt.show()
In [39]:
#mean DEFAULT by LOAN_PURPOSE
ax=df_loan.groupby(['LOAN_PURPOSE']).mean().reindex(['debt_consolidation', 'credit_card',\
                                                     'small_business', 'medical', 'other',\
                                                     'vacation', 'house', 'major_purchase',\
                                                     'home_improvement', 'wedding', 'car',\
                                                     'moving', 'renewable_energy', 'educational'])['DEFAULT'].plot.bar(figsize=(12, 5),color=['blue', 'orange','green','red','purple','chocolate','pink','grey','black','brown','yellow'])
ax.set_title('DEFAULT mean LOAN_PURPOSE')
ax.set_ylabel('DEFAULT mean')
Out[39]:
Text(0, 0.5, 'DEFAULT mean')
In [40]:
#mean DEFAULT by EMPLOYMENT_LENGTH
ax=df_loan.groupby(['EMPLOYMENT_LENGTH']).mean().reindex(['< 1 year','1-2 Years','3-4 Years',\
                                                          '5-6 Years','7-8 Years','9-10 Years',\
                                                          '>10 Years' 
 ])['DEFAULT'].plot.bar(figsize=(12, 5),color=['blue', 'orange','green','red','purple','chocolate','pink','grey','black','brown','yellow'])
ax.set_title('DEFAULT mean EMPLOYMENT_LENGTH')
ax.set_ylabel('DEFAULT mean')
Out[40]:
Text(0, 0.5, 'DEFAULT mean')
In [41]:
#mean DEFAULT by HOUSING
ax=df_loan.groupby(['HOUSING']).mean().reindex(['yes', 'no'])['DEFAULT'].plot.bar(figsize=(12, 5),color=['green','red'])
ax.set_title('DEFAULT mean HOUSING')
ax.set_ylabel('DEFAULT mean')
Out[41]:
Text(0, 0.5, 'DEFAULT mean')
In [42]:
df_plot = df_loan.groupby('ANNUAL_INCOME').DEFAULT.mean().reset_index()
plot_data = [
    go.Scatter(
        x=df_plot['ANNUAL_INCOME'],
        y=df_plot['DEFAULT'],
        mode='markers',
        name='Low',
        marker= dict(size= 7,
            line= dict(width=1),
            color= 'blue',
            opacity= 0.8
           ),
    )
]
plot_layout = go.Layout(
        yaxis= {'title': "DEFAULT"},
        xaxis= {'title': "ANNUAL_INCOME"},
        title='ANNUAL_INCOME vs DEFAULT',
        plot_bgcolor  = "rgb(243,243,243)",
        paper_bgcolor  = "rgb(243,243,243)",
    )
fig = go.Figure(data=plot_data, layout=plot_layout)
ply.iplot(fig)
In [43]:
df_loan.ANNUAL_INCOME.hist(bins=50, figsize=(20,15),color=['grey'])
plt.savefig("ANNUAL_INCOME_histogram_plot")
plt.show()
In [44]:
df_plot = df_loan.groupby('DEBT_TO_INCOME').DEFAULT.mean().reset_index()
plot_data = [
    go.Scatter(
        x=df_plot['DEBT_TO_INCOME'],
        y=df_plot['DEFAULT'],
        mode='markers',
        name='Low',
        marker= dict(size= 7,
            line= dict(width=1),
            color= 'blue',
            opacity= 0.8
           ),
    )
]
plot_layout = go.Layout(
        yaxis= {'title': "DEFAULT"},
        xaxis= {'title': "DEBT_TO_INCOME"},
        title='DEBT_TO_INCOME vs DEFAULT',
        plot_bgcolor  = "rgb(243,243,243)",
        paper_bgcolor  = "rgb(243,243,243)",
    )
fig = go.Figure(data=plot_data, layout=plot_layout)
ply.iplot(fig)
In [45]:
df_loan.DEBT_TO_INCOME.hist(bins=50, figsize=(20,15),color=['yellow'])
plt.savefig("DEBT_TO_INCOME_histogram_plot")
plt.show()
In [46]:
#move down
sns.pairplot(x_vars=['MONTHLY_PAYMENT'], y_vars=['LOAN_AMOUNT'], data=df_loan, hue ="LOAN_TERM", size=5)
plt.title("MONTHLY_PAYMENT vs LOAN_AMOUNT by EMPLOYMENT_LENGTH")
Out[46]:
Text(0.5, 1, 'MONTHLY_PAYMENT vs LOAN_AMOUNT by EMPLOYMENT_LENGTH')
In [47]:
features=['LOAN_AMOUNT', 'LOAN_TERM', 'INTEREST_RATE',
       'MONTHLY_PAYMENT', 'LOAN_PURPOSE', 'EMPLOYMENT_LENGTH', 'HOUSING',
       'ANNUAL_INCOME', 'DEBT_TO_INCOME', 'DEFAULT']
sns.set_style()
corr = df_loan[features].corr()
sns.heatmap(corr,cmap="RdYlBu",vmin=-1,vmax=1)
plt.title("correlation heat map")
plt.show()
In [48]:
features=['LOAN_AMOUNT', 'LOAN_TERM', 'INTEREST_RATE',
       'MONTHLY_PAYMENT', 'LOAN_PURPOSE', 'EMPLOYMENT_LENGTH', 'HOUSING',
       'ANNUAL_INCOME', 'DEBT_TO_INCOME', 'DEFAULT']
mask = np.zeros_like(df_loan[features].corr(), dtype=np.bool) 
mask[np.triu_indices_from(mask)] = True 

f, ax = plt.subplots(figsize=(16, 12))
plt.title('Correlation Matrix',fontsize=25)

sns.heatmap(df_loan[features].corr(),vmax=1.0,square=True,cmap="RdYlBu", 
            linecolor='w',annot=True,mask=mask,cbar_kws={"shrink": .75})
Out[48]:
<matplotlib.axes._subplots.AxesSubplot at 0x1344648ad08>

Now let’s look at how much each independent variable correlates with the DEFAULT.

In [49]:
corr_matrix = df_loan.corr()
corr_matrix["DEFAULT"].sort_values(ascending=False)
Out[49]:
DEFAULT            1.000000
INTEREST_RATE      0.177987
DEBT_TO_INCOME     0.041202
MONTHLY_PAYMENT    0.025158
LOAN_AMOUNT        0.020249
ANNUAL_INCOME     -0.044303
Name: DEFAULT, dtype: float64
In [50]:
attributes = ["DEFAULT", "INTEREST_RATE", "MONTHLY_PAYMENT",'LOAN_AMOUNT','DEBT_TO_INCOME']
scatter_matrix(df_loan[attributes], figsize=(12, 8))
plt.savefig('matrix.png')

5. Creating dummy variables

Create dummy variables for four categorical variables.

In [51]:
df_loan_dummies=pd.get_dummies(df_loan,columns=['LOAN_TERM','LOAN_PURPOSE','EMPLOYMENT_LENGTH', 'HOUSING'])
df_loan_dummies.head().T 
Out[51]:
0 1 2 3 4
LOAN_AMOUNT 20000.00 30000.00 21500.00 10000.00 5000.00
INTEREST_RATE 17.93 11.99 11.99 13.67 8.49
MONTHLY_PAYMENT 342.94 996.29 714.01 340.18 157.82
ANNUAL_INCOME 44304.00 136000.00 50000.00 64400.00 88000.00
DEBT_TO_INCOME 18.47 20.63 29.62 16.68 5.32
DEFAULT 1.00 0.00 0.00 0.00 0.00
LOAN_TERM_ 36 months 0.00 1.00 1.00 1.00 1.00
LOAN_TERM_ 60 months 1.00 0.00 0.00 0.00 0.00
LOAN_PURPOSE_car 0.00 0.00 0.00 0.00 0.00
LOAN_PURPOSE_credit_card 0.00 0.00 0.00 0.00 0.00
LOAN_PURPOSE_debt_consolidation 1.00 1.00 1.00 1.00 1.00
LOAN_PURPOSE_educational 0.00 0.00 0.00 0.00 0.00
LOAN_PURPOSE_home_improvement 0.00 0.00 0.00 0.00 0.00
LOAN_PURPOSE_house 0.00 0.00 0.00 0.00 0.00
LOAN_PURPOSE_major_purchase 0.00 0.00 0.00 0.00 0.00
LOAN_PURPOSE_medical 0.00 0.00 0.00 0.00 0.00
LOAN_PURPOSE_moving 0.00 0.00 0.00 0.00 0.00
LOAN_PURPOSE_other 0.00 0.00 0.00 0.00 0.00
LOAN_PURPOSE_renewable_energy 0.00 0.00 0.00 0.00 0.00
LOAN_PURPOSE_small_business 0.00 0.00 0.00 0.00 0.00
LOAN_PURPOSE_vacation 0.00 0.00 0.00 0.00 0.00
LOAN_PURPOSE_wedding 0.00 0.00 0.00 0.00 0.00
EMPLOYMENT_LENGTH_1-2 Years 1.00 0.00 1.00 1.00 0.00
EMPLOYMENT_LENGTH_3-4 Years 0.00 0.00 0.00 0.00 0.00
EMPLOYMENT_LENGTH_5-6 Years 0.00 0.00 0.00 0.00 0.00
EMPLOYMENT_LENGTH_7-8 Years 0.00 0.00 0.00 0.00 0.00
EMPLOYMENT_LENGTH_9-10 Years 0.00 0.00 0.00 0.00 0.00
EMPLOYMENT_LENGTH_< 1 year 0.00 0.00 0.00 0.00 0.00
EMPLOYMENT_LENGTH_>10 Years 0.00 1.00 0.00 0.00 1.00
HOUSING_no 0.00 0.00 1.00 1.00 0.00
HOUSING_yes 1.00 1.00 0.00 0.00 1.00
In [52]:
df_loan_dummies.to_csv("df_loan_dummies.csv",index=False)
In [53]:
df_loan_dummies.columns
Out[53]:
Index(['LOAN_AMOUNT', 'INTEREST_RATE', 'MONTHLY_PAYMENT', 'ANNUAL_INCOME',
       'DEBT_TO_INCOME', 'DEFAULT', 'LOAN_TERM_ 36 months',
       'LOAN_TERM_ 60 months', 'LOAN_PURPOSE_car', 'LOAN_PURPOSE_credit_card',
       'LOAN_PURPOSE_debt_consolidation', 'LOAN_PURPOSE_educational',
       'LOAN_PURPOSE_home_improvement', 'LOAN_PURPOSE_house',
       'LOAN_PURPOSE_major_purchase', 'LOAN_PURPOSE_medical',
       'LOAN_PURPOSE_moving', 'LOAN_PURPOSE_other',
       'LOAN_PURPOSE_renewable_energy', 'LOAN_PURPOSE_small_business',
       'LOAN_PURPOSE_vacation', 'LOAN_PURPOSE_wedding',
       'EMPLOYMENT_LENGTH_1-2 Years', 'EMPLOYMENT_LENGTH_3-4 Years',
       'EMPLOYMENT_LENGTH_5-6 Years', 'EMPLOYMENT_LENGTH_7-8 Years',
       'EMPLOYMENT_LENGTH_9-10 Years', 'EMPLOYMENT_LENGTH_< 1 year',
       'EMPLOYMENT_LENGTH_>10 Years', 'HOUSING_no', 'HOUSING_yes'],
      dtype='object')

6. Split data to Train & Test sets

In [54]:
X = df_loan_dummies.drop("DEFAULT",axis=1)
y = df_loan_dummies["DEFAULT"]
y=y.astype('int')

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=47, stratify=y)

7. Handling imbalanced data

7.1 Checking the balance

In [55]:
df_loan_dummies["DEFAULT"].value_counts()
Out[55]:
0    22442
1     2557
Name: DEFAULT, dtype: int64
In [56]:
sns.countplot(x="DEFAULT",data=df_loan_dummies)
plt.show()
In [57]:
count_no_default = len(df_loan_dummies[df_loan_dummies['DEFAULT']==0])
count_default = len(df_loan_dummies[df_loan_dummies['DEFAULT']==1])
pct_of_no_default = count_no_default/(count_no_default+count_default)
print("\033[1m percentage of no default is", pct_of_no_default*100)
pct_of_default = count_default/(count_no_default+count_default)
print("\033[1m percentage of default", pct_of_default*100)
 percentage of no default is 89.77159086363454
 percentage of default 10.228409136365455

Our classes are imbalanced, and the ratio of no-default to default instances is 90:10.

7.2 Over-sampling using SMOTE

With our training data created, I’ll up-sample the default using the SMOTE algorithm (Synthetic Minority Oversampling Technique). At a high level, SMOTE:

  1. Works by creating synthetic samples from the minor class (default) instead of creating copies.
  2. Randomly choosing one of the k-nearest-neighbors and using it to create a similar, but randomly tweaked, new observations.

We are going to implement SMOTE in Python.

In [58]:
os = SMOTE(random_state=47)
columns1 = X.columns

os_data_X,os_data_y=os.fit_sample(X_train, y_train)
os_data_X = pd.DataFrame(data=os_data_X,columns=columns1)
os_data_y= pd.DataFrame(data=os_data_y,columns=['DEFAULT'])

# we can Check the numbers of our data
print("length of oversampled data is ",len(os_data_X))
print("Number of no default in oversampled data",len(os_data_y[os_data_y['DEFAULT']==0]))
print("Number of default",len(os_data_y[os_data_y['DEFAULT']==1]))
print("Proportion of no default data in oversampled data is ",len(os_data_y[os_data_y['DEFAULT']==0])/len(os_data_X))
print("Proportion of default data in oversampled data is ",len(os_data_y[os_data_y['DEFAULT']==1])/len(os_data_X))
length of oversampled data is  30072
Number of no default in oversampled data 15036
Number of default 15036
Proportion of no default data in oversampled data is  0.5
Proportion of default data in oversampled data is  0.5
In [59]:
cols=columns1
In [60]:
X_train=os_data_X[cols]
y_train=os_data_y['DEFAULT']
In [61]:
X_train.columns
Out[61]:
Index(['LOAN_AMOUNT', 'INTEREST_RATE', 'MONTHLY_PAYMENT', 'ANNUAL_INCOME',
       'DEBT_TO_INCOME', 'LOAN_TERM_ 36 months', 'LOAN_TERM_ 60 months',
       'LOAN_PURPOSE_car', 'LOAN_PURPOSE_credit_card',
       'LOAN_PURPOSE_debt_consolidation', 'LOAN_PURPOSE_educational',
       'LOAN_PURPOSE_home_improvement', 'LOAN_PURPOSE_house',
       'LOAN_PURPOSE_major_purchase', 'LOAN_PURPOSE_medical',
       'LOAN_PURPOSE_moving', 'LOAN_PURPOSE_other',
       'LOAN_PURPOSE_renewable_energy', 'LOAN_PURPOSE_small_business',
       'LOAN_PURPOSE_vacation', 'LOAN_PURPOSE_wedding',
       'EMPLOYMENT_LENGTH_1-2 Years', 'EMPLOYMENT_LENGTH_3-4 Years',
       'EMPLOYMENT_LENGTH_5-6 Years', 'EMPLOYMENT_LENGTH_7-8 Years',
       'EMPLOYMENT_LENGTH_9-10 Years', 'EMPLOYMENT_LENGTH_< 1 year',
       'EMPLOYMENT_LENGTH_>10 Years', 'HOUSING_no', 'HOUSING_yes'],
      dtype='object')

8. Sanity check between full data and selected data for prediction

In [62]:
ax=pd.value_counts(y_train.values,normalize=True).plot.bar(figsize=(12, 5),color=['blue','orange'])
ax.set_title('DEFAULT')
ax.set_xticklabels(['DEFAULT=True','NO DEFAULT=False'])
plt.show()
In [63]:
plt.figure(figsize=(25,25))
plt.subplot(1, 2, 1)
plt.title("Data before balancing LOAN_TERM_ 36 months")
pd.value_counts(df_loan_dummies['LOAN_TERM_ 36 months'].values,normalize=True).plot.bar(figsize=(12, 5),color=['blue','darkorange'])
plt.subplot(1, 2, 2)
plt.title("Data after balancing LOAN_TERM_ 36 months")
pd.value_counts(X_train['LOAN_TERM_ 36 months'].values,normalize=True).plot.bar(figsize=(12, 5),color=['blue','darkorange'])
Out[63]:
<matplotlib.axes._subplots.AxesSubplot at 0x13443c31308>
In [64]:
plt.figure(figsize=(25,25))
plt.subplot(1, 2, 1)
plt.title("Data before LOAN_TERM_ 60 months")
pd.value_counts(df_loan_dummies['LOAN_TERM_ 60 months'].values,normalize=True).plot.bar(figsize=(12, 5),color=['blue','darkorange'])
plt.subplot(1, 2, 2)
plt.title("Data after balancing LOAN_TERM_ 60 months")
pd.value_counts(X_train['LOAN_TERM_ 60 months'].values,normalize=True).plot.bar(figsize=(12, 5),color=['blue','darkorange'])
Out[64]:
<matplotlib.axes._subplots.AxesSubplot at 0x1344517db08>
In [65]:
plt.figure(figsize=(25,25))
plt.subplot(1, 2, 1)
plt.title("Data before LOAN_PURPOSE_car")
pd.value_counts(df_loan_dummies['LOAN_PURPOSE_car'].values,normalize=True).plot.bar(figsize=(12, 5),color=['blue','darkorange'])
plt.subplot(1, 2, 2)
plt.title("Data after balancing LOAN_PURPOSE_car")
pd.value_counts(X_train['LOAN_PURPOSE_car'].values,normalize=True).plot.bar(figsize=(12, 5),color=['blue','darkorange'])
Out[65]:
<matplotlib.axes._subplots.AxesSubplot at 0x134464185c8>
In [66]:
plt.figure(figsize=(25,25))
plt.subplot(1, 2, 1)
plt.title("Data before LOAN_PURPOSE_credit_card")
pd.value_counts(df_loan_dummies['LOAN_PURPOSE_credit_card'].values,normalize=True).plot.bar(figsize=(12, 5),color=['blue','darkorange'])
plt.subplot(1, 2, 2)
plt.title("Data after balancing LOAN_PURPOSE_credit_card")
pd.value_counts(X_train['LOAN_PURPOSE_credit_card'].values,normalize=True).plot.bar(figsize=(12, 5),color=['blue','darkorange'])
Out[66]:
<matplotlib.axes._subplots.AxesSubplot at 0x13445601408>
In [67]:
plt.figure(figsize=(25,25))
plt.subplot(1, 2, 1)
plt.title("Data before LOAN_PURPOSE_debt_consolidation")
pd.value_counts(df_loan_dummies['LOAN_PURPOSE_debt_consolidation'].values,normalize=True).plot.bar(figsize=(12, 5),color=['blue','darkorange'])
plt.subplot(1, 2, 2)
plt.title("Data after balancing LOAN_PURPOSE_debt_consolidation")
pd.value_counts(X_train['LOAN_PURPOSE_debt_consolidation'].values,normalize=True).plot.bar(figsize=(12, 5),color=['blue','darkorange'])
Out[67]:
<matplotlib.axes._subplots.AxesSubplot at 0x134458432c8>
In [68]:
plt.figure(figsize=(25,25))
plt.subplot(1, 2, 1)
plt.title("Data before LOAN_PURPOSE_home_improvement")
pd.value_counts(df_loan_dummies['LOAN_PURPOSE_home_improvement'].values,normalize=True).plot.bar(figsize=(12, 5),color=['blue','darkorange'])
plt.subplot(1, 2, 2)
plt.title("Data after balancing LOAN_PURPOSE_home_improvement")
pd.value_counts(X_train['LOAN_PURPOSE_home_improvement'].values,normalize=True).plot.bar(figsize=(12, 5),color=['blue','darkorange'])
Out[68]:
<matplotlib.axes._subplots.AxesSubplot at 0x1344593bbc8>
In [69]:
plt.figure(figsize=(25,25))
plt.subplot(1, 2, 1)
plt.title("Data before LOAN_PURPOSE_major_purchase")
pd.value_counts(df_loan_dummies['LOAN_PURPOSE_major_purchase'].values,normalize=True).plot.bar(figsize=(12, 5),color=['blue','darkorange'])
plt.subplot(1, 2, 2)
plt.title("Data after balancing LOAN_PURPOSE_major_purchase")
pd.value_counts(X_train['LOAN_PURPOSE_major_purchase'].values,normalize=True).plot.bar(figsize=(12, 5),color=['blue','darkorange'])
Out[69]:
<matplotlib.axes._subplots.AxesSubplot at 0x13443e9fd88>
In [70]:
plt.figure(figsize=(25,25))
plt.subplot(1, 2, 1)
plt.title("Data before LOAN_PURPOSE_medical")
pd.value_counts(df_loan_dummies['LOAN_PURPOSE_medical'].values,normalize=True).plot.bar(figsize=(12, 5),color=['blue','darkorange'])
plt.subplot(1, 2, 2)
plt.title("Data after balancing LOAN_PURPOSE_medical")
pd.value_counts(X_train['LOAN_PURPOSE_medical'].values,normalize=True).plot.bar(figsize=(12, 5),color=['blue','darkorange'])
Out[70]:
<matplotlib.axes._subplots.AxesSubplot at 0x13443f3b2c8>
In [71]:
plt.figure(figsize=(25,25))
plt.subplot(1, 2, 1)
plt.title("Data before LOAN_PURPOSE_moving")
pd.value_counts(df_loan_dummies['LOAN_PURPOSE_moving'].values,normalize=True).plot.bar(figsize=(12, 5),color=['blue','darkorange'])
plt.subplot(1, 2, 2)
plt.title("Data after balancing LOAN_PURPOSE_moving")
pd.value_counts(X_train['LOAN_PURPOSE_moving'].values,normalize=True).plot.bar(figsize=(12, 5),color=['blue','darkorange'])
Out[71]:
<matplotlib.axes._subplots.AxesSubplot at 0x13443c80288>
In [72]:
plt.figure(figsize=(25,25))
plt.subplot(1, 2, 1)
plt.title("Data before LOAN_PURPOSE_other")
pd.value_counts(df_loan_dummies['LOAN_PURPOSE_other'].values,normalize=True).plot.bar(figsize=(12, 5),color=['blue','darkorange'])
plt.subplot(1, 2, 2)
plt.title("Data after balancing LOAN_PURPOSE_other")
pd.value_counts(X_train['LOAN_PURPOSE_other'].values,normalize=True).plot.bar(figsize=(12, 5),color=['blue','darkorange'])
Out[72]:
<matplotlib.axes._subplots.AxesSubplot at 0x13441c5a908>
In [73]:
plt.figure(figsize=(25,25))
plt.subplot(1, 2, 1)
plt.title("Data before LOAN_PURPOSE_vacation")
pd.value_counts(df_loan_dummies['LOAN_PURPOSE_vacation'].values,normalize=True).plot.bar(figsize=(12, 5),color=['blue','darkorange'])
plt.subplot(1, 2, 2)
plt.title("Data after balancing LOAN_PURPOSE_vacation")
pd.value_counts(X_train['LOAN_PURPOSE_vacation'].values,normalize=True).plot.bar(figsize=(12, 5),color=['blue','darkorange'])
Out[73]:
<matplotlib.axes._subplots.AxesSubplot at 0x134451f4148>
In [74]:
plt.figure(figsize=(25,25))
plt.subplot(1, 2, 1)
plt.title("Data before EMPLOYMENT_LENGTH_1-2 Years")
pd.value_counts(df_loan_dummies['EMPLOYMENT_LENGTH_1-2 Years'].values,normalize=True).plot.bar(figsize=(12, 5),color=['blue','darkorange'])
plt.subplot(1, 2, 2)
plt.title("Data after balancing EMPLOYMENT_LENGTH_1-2 Years")
pd.value_counts(X_train['EMPLOYMENT_LENGTH_1-2 Years'].values,normalize=True).plot.bar(figsize=(12, 5),color=['blue','darkorange'])
Out[74]:
<matplotlib.axes._subplots.AxesSubplot at 0x13441a3dbc8>
In [75]:
plt.figure(figsize=(25,25))
plt.subplot(1, 2, 1)
plt.title("Data before EMPLOYMENT_LENGTH_3-4 Years")
pd.value_counts(df_loan_dummies['EMPLOYMENT_LENGTH_3-4 Years'].values,normalize=True).plot.bar(figsize=(12, 5),color=['blue','darkorange'])
plt.subplot(1, 2, 2)
plt.title("Data after balancing EMPLOYMENT_LENGTH_3-4 Years")
pd.value_counts(X_train['EMPLOYMENT_LENGTH_3-4 Years'].values,normalize=True).plot.bar(figsize=(12, 5),color=['blue','darkorange'])
Out[75]:
<matplotlib.axes._subplots.AxesSubplot at 0x13443d1c608>
In [76]:
plt.figure(figsize=(25,25))
plt.subplot(1, 2, 1)
plt.title("Data before EMPLOYMENT_LENGTH_5-6 Years")
pd.value_counts(df_loan_dummies['EMPLOYMENT_LENGTH_5-6 Years'].values,normalize=True).plot.bar(figsize=(12, 5),color=['blue','darkorange'])
plt.subplot(1, 2, 2)
plt.title("Data after balancing EMPLOYMENT_LENGTH_5-6 Years")
pd.value_counts(X_train['EMPLOYMENT_LENGTH_5-6 Years'].values,normalize=True).plot.bar(figsize=(12, 5),color=['blue','darkorange'])
Out[76]:
<matplotlib.axes._subplots.AxesSubplot at 0x134441a4208>
In [77]:
plt.figure(figsize=(25,25))
plt.subplot(1, 2, 1)
plt.title("Data before EMPLOYMENT_LENGTH_7-8 Years")
pd.value_counts(df_loan_dummies['EMPLOYMENT_LENGTH_7-8 Years'].values,normalize=True).plot.bar(figsize=(12, 5),color=['blue','darkorange'])
plt.subplot(1, 2, 2)
plt.title("Data after balancing EMPLOYMENT_LENGTH_7-8 Years")
pd.value_counts(X_train['EMPLOYMENT_LENGTH_7-8 Years'].values,normalize=True).plot.bar(figsize=(12, 5),color=['blue','darkorange'])
Out[77]:
<matplotlib.axes._subplots.AxesSubplot at 0x13444205ac8>
In [78]:
plt.figure(figsize=(25,25))
plt.subplot(1, 2, 1)
plt.title("Data before EMPLOYMENT_LENGTH_9-10 Years")
pd.value_counts(df_loan_dummies['EMPLOYMENT_LENGTH_9-10 Years'].values,normalize=True).plot.bar(figsize=(12, 5),color=['blue','darkorange'])
plt.subplot(1, 2, 2)
plt.title("Data after balancing EMPLOYMENT_LENGTH_9-10 Years")
pd.value_counts(X_train['EMPLOYMENT_LENGTH_9-10 Years'].values,normalize=True).plot.bar(figsize=(12, 5),color=['blue','darkorange'])
Out[78]:
<matplotlib.axes._subplots.AxesSubplot at 0x134456fd888>
In [79]:
plt.figure(figsize=(25,25))
plt.subplot(1, 2, 1)
plt.title("Data before EMPLOYMENT_LENGTH_< 1 year")
pd.value_counts(df_loan_dummies['EMPLOYMENT_LENGTH_< 1 year'].values,normalize=True).plot.bar(figsize=(12, 5),color=['blue','darkorange'])
plt.subplot(1, 2, 2)
plt.title("Data after balancing EMPLOYMENT_LENGTH_< 1 year")
pd.value_counts(X_train['EMPLOYMENT_LENGTH_< 1 year'].values,normalize=True).plot.bar(figsize=(12, 5),color=['blue','darkorange'])
Out[79]:
<matplotlib.axes._subplots.AxesSubplot at 0x134440d9ec8>
In [80]:
plt.figure(figsize=(25,25))
plt.subplot(1, 2, 1)
plt.title("Data before EMPLOYMENT_LENGTH_>10 Years")
pd.value_counts(df_loan_dummies['EMPLOYMENT_LENGTH_>10 Years'].values,normalize=True).plot.bar(figsize=(12, 5),color=['blue','darkorange'])
plt.subplot(1, 2, 2)
plt.title("Data after balancing EMPLOYMENT_LENGTH_>10 Years")
pd.value_counts(X_train['EMPLOYMENT_LENGTH_>10 Years'].values,normalize=True).plot.bar(figsize=(12, 5),color=['blue','darkorange'])
Out[80]:
<matplotlib.axes._subplots.AxesSubplot at 0x134442e0888>
In [81]:
plt.figure(figsize=(25,25))
plt.subplot(1, 2, 1)
plt.title("Data before HOUSING_yes")
pd.value_counts(df_loan_dummies['HOUSING_yes'].values,normalize=True).plot.bar(figsize=(12, 5),color=['blue','darkorange'])
plt.subplot(1, 2, 2)
plt.title("Data after balancing HOUSING_yes")
pd.value_counts(X_train['HOUSING_yes'].values,normalize=True).plot.bar(figsize=(12, 5),color=['blue','darkorange'])
Out[81]:
<matplotlib.axes._subplots.AxesSubplot at 0x134447a5288>
In [82]:
plt.figure(figsize=(25,25))
plt.subplot(1, 2, 1)
plt.title("Data before HOUSING_no")
pd.value_counts(df_loan_dummies['HOUSING_no'].values,normalize=True).plot.bar(figsize=(12, 5),color=['blue','darkorange'])
plt.subplot(1, 2, 2)
plt.title("Data after balancing HOUSING_no")
pd.value_counts(X_train['HOUSING_no'].values,normalize=True).plot.bar(figsize=(12, 5),color=['blue','darkorange'])
Out[82]:
<matplotlib.axes._subplots.AxesSubplot at 0x13441f31fc8>
In [83]:
relevantColumns=['LOAN_TERM_ 36 months', 'LOAN_TERM_ 60 months', 'LOAN_PURPOSE_car', 'LOAN_PURPOSE_credit_card', 'LOAN_PURPOSE_debt_consolidation', 'LOAN_PURPOSE_home_improvement', 'LOAN_PURPOSE_major_purchase', 'LOAN_PURPOSE_medical', 'LOAN_PURPOSE_moving', 'LOAN_PURPOSE_other', 'LOAN_PURPOSE_vacation', 'EMPLOYMENT_LENGTH_1-2 Years', 'EMPLOYMENT_LENGTH_3-4 Years', 'EMPLOYMENT_LENGTH_5-6 Years', 'EMPLOYMENT_LENGTH_7-8 Years', 'EMPLOYMENT_LENGTH_9-10 Years', 'EMPLOYMENT_LENGTH_< 1 year', 'EMPLOYMENT_LENGTH_>10 Years', 'HOUSING_no', 'HOUSING_yes','DEFAULT']
In [84]:
plt.figure(figsize=(10,10))
sns.set_style()
corr = df_loan_dummies[relevantColumns].corr()
sns.heatmap(corr,cmap="RdYlBu",vmin=-1,vmax=1)
plt.title("correlation heat map")
Out[84]:
Text(0.5, 1, 'correlation heat map')

9. Training & Evaluating different machine learning classification models

9.1. K-Nearest Neighbors

We will create a SyntheticGridSearch for the K-Nearest Neighbors Classifier in order to find the best hyperparameter for this algorithm

In [85]:
knn_roc = []
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import roc_auc_score
for i in range (1,21):
    knn = KNeighborsClassifier(n_neighbors=i)
    knn.fit(X_train, y_train)
    y_pred1 = knn.predict(X_test)
    roc1 = roc_auc_score(y_test, y_pred1)
    print(i, roc1)
    knn_roc.append(roc1)
print(max(knn_roc))
1 0.5350086006862631
2 0.5245970668076223
3 0.5322066583646152
4 0.5340383037706073
5 0.5374302634088154
6 0.5306068283305583
7 0.5322975287105497
8 0.5295754179076015
9 0.5340189458271953
10 0.533501080845171
11 0.537691515653377
12 0.5369103186477469
13 0.534374908009773
14 0.5338029687725976
15 0.5363833346345284
16 0.5316711952522164
17 0.5384440757013974
18 0.533897358744607
19 0.5359680187576872
20 0.5354551132487685
0.5384440757013974

We will refit an estimator using the best found parameter {'n_neighbors': 17}

9.1.1 Training a Predictive Model

In [86]:
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors=17)
knn.fit(X_train, y_train)
Out[86]:
KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                     metric_params=None, n_jobs=None, n_neighbors=17, p=2,
                     weights='uniform')

9.1.2 Evaluating the Model Accuracy

In [87]:
print('\n accuracy:')
print('---------------')
acc1=knn.score(X_test, y_test)
print(acc1)
 accuracy:
---------------
0.6378181818181818

9.1.3 Confusion Matrix

In [88]:
y_pred1 = knn.predict(X_test)
from sklearn.metrics import confusion_matrix
cm1 = confusion_matrix(y_true=y_test, y_pred=y_pred1)
cmDf1=pd.DataFrame(cm1, index=knn.classes_, columns=knn.classes_)
print('\nConfusion Matrix')
print('-------------------')
print(cmDf1)
Confusion Matrix
-------------------
      0     1
0  4913  2493
1   495   349
In [89]:
print("\033[1m The result is telling us that we have: ",(cm1[0,0]+cm1[1,1]),"correct predictions.")
print("\033[1m The result is telling us that we have: ",(cm1[0,1]+cm1[1,0]),"incorrect predictions.")
print("\033[1m We have a total predictions of: ",(cm1.sum()))
 The result is telling us that we have:  5262 correct predictions.
 The result is telling us that we have:  2988 incorrect predictions.
 We have a total predictions of:  8250

9.1.4 Classification Report

In [90]:
from sklearn.metrics import classification_report
print('\nclassification_report')
print('------------------------')
print(classification_report(y_test, y_pred1))
classification_report
------------------------
              precision    recall  f1-score   support

           0       0.91      0.66      0.77      7406
           1       0.12      0.41      0.19       844

    accuracy                           0.64      8250
   macro avg       0.52      0.54      0.48      8250
weighted avg       0.83      0.64      0.71      8250

9.1.5 Precision and Recall

In [91]:
per1=metrics.precision_score(y_test, y_pred1)
rec1=metrics.recall_score(y_test, y_pred1)
print("\033[1m Precision of the model:", "{:.2%}".format(per1))
print("\033[1m Recall of the model:", "{:.2%}".format(rec1))
 Precision of the model: 12.28%
 Recall of the model: 41.35%

9.1.6 The ROC Curve

In [92]:
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
roc1 = roc_auc_score(y_test, y_pred1)
fpr, tpr, thresholds = roc_curve(y_test, knn.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='K-Nearest Neighbors (area = %0.4f)' % roc1)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()

9.1.7 ROC AUC

In [93]:
print("\033[1m The ROC AUC score using the K-Nearest Neighbors algorithm is:", "{:.4%}".format(roc1))
 The ROC AUC score using the K-Nearest Neighbors algorithm is: 53.8444%

9.2. Logistic Regression

We will create a SyntheticGridSearch for the Logistic Regression in order to find the best hyperparameter for this algorithm

In [94]:
LR_roc = []
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import roc_auc_score
C = [0.001, 0.01, 0.1, 1, 10, 100, 1000]
for i in C:
    logreg = LogisticRegression(C=i, random_state=47)
    logreg.fit(X_train, y_train)
    y_pred1 = logreg.predict(X_test)
    roc1 = roc_auc_score(y_test, y_pred1)
    print(i, roc1)
    LR_roc.append(roc1)
print(max(LR_roc))
0.001 0.6037150613118862
0.01 0.6090772116370355
0.1 0.6091447244644729
1 0.6091447244644729
10 0.6091447244644729
100 0.6091447244644729
1000 0.6091447244644729
0.6091447244644729

We will refit an estimator using the best found parameter {'C': 1}

9.2.1 Training a Predictive Model

In [95]:
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression(C=1, random_state=47)
logreg.fit(X_train, y_train)
Out[95]:
LogisticRegression(C=1, class_weight=None, dual=False, fit_intercept=True,
                   intercept_scaling=1, l1_ratio=None, max_iter=100,
                   multi_class='auto', n_jobs=None, penalty='l2',
                   random_state=47, solver='lbfgs', tol=0.0001, verbose=0,
                   warm_start=False)

9.2.2 Evaluating the Model Accuracy

In [96]:
print('\n accuracy:')
print('---------------')
acc2=logreg.score(X_test, y_test)
print(acc2)
 accuracy:
---------------
0.5103030303030303

9.2.3 Confusion Matrix

In [97]:
y_pred2 = logreg.predict(X_test)
from sklearn.metrics import confusion_matrix
cm2 = confusion_matrix(y_true=y_test, y_pred=y_pred2)
cmDf2=pd.DataFrame(cm2, index=logreg.classes_, columns=logreg.classes_)
print('\nConfusion Matrix')
print('-------------------')
print(cmDf2)
Confusion Matrix
-------------------
      0     1
0  3591  3815
1   225   619
In [98]:
print("\033[1m The result is telling us that we have: ",(cm2[0,0]+cm2[1,1]),"correct predictions.")
print("\033[1m The result is telling us that we have: ",(cm2[0,1]+cm2[1,0]),"incorrect predictions.")
print("\033[1m We have a total predictions of: ",(cm2.sum()))
 The result is telling us that we have:  4210 correct predictions.
 The result is telling us that we have:  4040 incorrect predictions.
 We have a total predictions of:  8250

9.2.4 Classification Report

In [99]:
from sklearn.metrics import classification_report
print('\nclassification_report')
print('------------------------')
print(classification_report(y_test, y_pred2))
classification_report
------------------------
              precision    recall  f1-score   support

           0       0.94      0.48      0.64      7406
           1       0.14      0.73      0.23       844

    accuracy                           0.51      8250
   macro avg       0.54      0.61      0.44      8250
weighted avg       0.86      0.51      0.60      8250

9.2.5 Precision and Recall

In [100]:
per2=metrics.precision_score(y_test, y_pred2)
rec2=metrics.recall_score(y_test, y_pred2)
print("\033[1m Precision of the model:", "{:.2%}".format(per2))
print("\033[1m Recall of the model:", "{:.2%}".format(rec2))
 Precision of the model: 13.96%
 Recall of the model: 73.34%

9.2.6 The ROC Curve

In [101]:
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
roc2 = roc_auc_score(y_test, y_pred2)
fpr, tpr, thresholds = roc_curve(y_test, logreg.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Logistic Regression (area = %0.4f)' % roc2)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()

9.2.7 ROC AUC

In [102]:
print("\033[1m The ROC AUC score using the Logistic Regression algorithm is:", "{:.4%}".format(roc2))
 The ROC AUC score using the Logistic Regression algorithm is: 60.9145%

9.3. Support Vector Machine

We have created a GridSearch for the Support Vector Machine in order to find the best hyperparameters for this algorithm. We have found estimator using the best found parameters. {'C': 0.1, 'gamma':0.001}

9.3.1 Training a Predictive Model

In [103]:
from sklearn.svm import SVC
svc = SVC(probability=True, C=0.1, gamma=0.001, random_state=47)
svc.fit(X_train, y_train)
Out[103]:
SVC(C=0.1, break_ties=False, cache_size=200, class_weight=None, coef0=0.0,
    decision_function_shape='ovr', degree=3, gamma=0.001, kernel='rbf',
    max_iter=-1, probability=True, random_state=47, shrinking=True, tol=0.001,
    verbose=False)

9.3.2 Evaluating the Model Accuracy

In [104]:
print('\n accuracy:')
print('---------------')
acc3=svc.score(X_test, y_test)
print(acc3)
 accuracy:
---------------
0.4450909090909091

9.3.3 Confusion Matrix

In [105]:
y_pred3 = svc.predict(X_test)
from sklearn.metrics import confusion_matrix
cm3 = confusion_matrix(y_true=y_test, y_pred=y_pred3)
cmDf3=pd.DataFrame(cm3, index=svc.classes_, columns=svc.classes_)
print('\nConfusion Matrix')
print('-------------------')
print(cmDf3)
Confusion Matrix
-------------------
      0     1
0  3121  4285
1   293   551
In [106]:
print("\033[1m The result is telling us that we have: ",(cm3[0,0]+cm3[1,1]),"correct predictions.")
print("\033[1m The result is telling us that we have: ",(cm3[0,1]+cm3[1,0]),"incorrect predictions.")
print("\033[1m We have a total predictions of: ",(cm3.sum()))
 The result is telling us that we have:  3672 correct predictions.
 The result is telling us that we have:  4578 incorrect predictions.
 We have a total predictions of:  8250

9.3.4 Classification Report

In [107]:
from sklearn.metrics import classification_report
print('\nclassification_report')
print('------------------------')
print(classification_report(y_test, y_pred3))
classification_report
------------------------
              precision    recall  f1-score   support

           0       0.91      0.42      0.58      7406
           1       0.11      0.65      0.19       844

    accuracy                           0.45      8250
   macro avg       0.51      0.54      0.39      8250
weighted avg       0.83      0.45      0.54      8250

9.3.5 Precision and Recall

In [108]:
per3=metrics.precision_score(y_test, y_pred3)
rec3=metrics.recall_score(y_test, y_pred3)
print("\033[1m Precision of the model:", "{:.2%}".format(per3))
print("\033[1m Recall of the model:", "{:.2%}".format(rec3))
 Precision of the model: 11.39%
 Recall of the model: 65.28%

9.3.6 The ROC Curve

In [109]:
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
roc3 = roc_auc_score(y_test, y_pred3)
fpr, tpr, thresholds = roc_curve(y_test, svc.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Support Vector Machine (area = %0.4f)' % roc3)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()

9.3.7 ROC AUC

In [110]:
print("\033[1m The ROC AUC score using the Support Vector Machine algorithm is:", "{:.4%}".format(roc3))
 The ROC AUC score using the Support Vector Machine algorithm is: 53.7129%

9.4. Decision Trees

We will create a SyntheticGridSearch for the Decision Trees in order to find the best hyperparameter of this algorithm

In [111]:
DT_roc = []
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_auc_score
for i in range (1,21):
    dt = DecisionTreeClassifier(max_depth=i, random_state=47)
    dt.fit(X_train, y_train)
    y_pred1 = dt.predict(X_test)
    roc1 = roc_auc_score(y_test, y_pred1)
    print(i, roc1)
    DT_roc.append(roc1)
print(max(DT_roc))
1 0.6226568889321198
2 0.5701432679792098
3 0.5371930086147647
4 0.5485415629443527
5 0.5767849623655983
6 0.580293869579296
7 0.5492842040461621
8 0.5427392993768343
9 0.5276708202520565
10 0.5130530132478726
11 0.5197922972663384
12 0.5176842012304612
13 0.5160723724711487
14 0.5128755920970955
15 0.5117869077589197
16 0.5188266398577815
17 0.5183017356236074
18 0.5227017481662748
19 0.5189514266004379
20 0.517091464202843
0.6226568889321198

We will refit an estimator using the best found parameter {'max_depth': 1}

9.4.1 Training a Predictive Model

In [112]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import f1_score
dt = DecisionTreeClassifier(max_depth=1, random_state=47)
dt.fit(X_train, y_train)
Out[112]:
DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=1, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=47, splitter='best')

9.4.2 Evaluating the Model Accuracy

In [113]:
print('\n accuracy:')
print('---------------')
acc4=dt.score(X_test, y_test)
print(acc4)
 accuracy:
---------------
0.4723636363636364

9.4.3 Confusion Matrix

In [114]:
y_pred4 = dt.predict(X_test)
from sklearn.metrics import confusion_matrix
cm4 = confusion_matrix(y_true=y_test, y_pred=y_pred4)
cmDf4=pd.DataFrame(cm4, index=dt.classes_, columns=dt.classes_)
print('\nConfusion Matrix')
print('-------------------')
print(cmDf4)
Confusion Matrix
-------------------
      0     1
0  3212  4194
1   159   685
In [115]:
print("\033[1m The result is telling us that we have: ",(cm4[0,0]+cm4[1,1]),"correct predictions.")
print("\033[1m The result is telling us that we have: ",(cm4[0,1]+cm4[1,0]),"incorrect predictions.")
print("\033[1m We have a total predictions of: ",(cm4.sum()))
 The result is telling us that we have:  3897 correct predictions.
 The result is telling us that we have:  4353 incorrect predictions.
 We have a total predictions of:  8250

9.4.4 Classification Report

In [116]:
from sklearn.metrics import classification_report
print('\nclassification_report')
print('------------------------')
print(classification_report(y_test, y_pred4))
classification_report
------------------------
              precision    recall  f1-score   support

           0       0.95      0.43      0.60      7406
           1       0.14      0.81      0.24       844

    accuracy                           0.47      8250
   macro avg       0.55      0.62      0.42      8250
weighted avg       0.87      0.47      0.56      8250

9.4.5 Precision and Recall

In [117]:
per4=metrics.precision_score(y_test, y_pred4)
rec4=metrics.recall_score(y_test, y_pred4)
print("\033[1m Precision of the model:", "{:.2%}".format(per4))
print("\033[1m Recall of the model:", "{:.2%}".format(rec4))
 Precision of the model: 14.04%
 Recall of the model: 81.16%

9.4.6 The ROC Curve

In [118]:
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
roc4 = roc_auc_score(y_test, y_pred4)
fpr, tpr, thresholds = roc_curve(y_test, dt.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Decision Trees (area = %0.4f)' % roc4)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()

9.4.7 ROC AUC

In [119]:
print("\033[1m The ROC AUC score using the Decision Trees algorithm is:", "{:.4%}".format(roc4))
 The ROC AUC score using the Decision Trees algorithm is: 62.2657%

9.5. Random Forest

We will create a SyntheticGridSearch for the Random Forest in order to find the best hyperparameter of this algorithm

In [120]:
RF_roc = []
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import roc_auc_score
for i in range (1,21):
    rf = RandomForestClassifier(max_depth=i, random_state=47)
    rf.fit(X_train, y_train)
    y_pred1 = rf.predict(X_test)
    roc1 = roc_auc_score(y_test, y_pred1)
    print(i, roc1)
    RF_roc.append(roc1)
print(max(RF_roc))
1 0.5978833608717411
2 0.5923999434300099
3 0.5827377699393216
4 0.5766672148750916
5 0.5565539917039213
6 0.5500207657938422
7 0.5220339791100593
8 0.5139444385428492
9 0.5182871771702975
10 0.5131427637127831
11 0.5121689471710527
12 0.5122060632278427
13 0.5107663761801947
14 0.5076844636025869
15 0.5120862359582917
16 0.5100018174069186
17 0.5078042908721379
18 0.5111039403173807
19 0.5085317335886236
20 0.5098515933667207
0.5978833608717411

We will refit an estimator using the best found parameter {'max_depth': 1}

9.5.1 Training a Predictive Model

In [121]:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(max_depth=1, random_state=47)
rf.fit(X_train, y_train)
Out[121]:
RandomForestClassifier(bootstrap=True, ccp_alpha=0.0, class_weight=None,
                       criterion='gini', max_depth=1, max_features='auto',
                       max_leaf_nodes=None, max_samples=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, n_estimators=100,
                       n_jobs=None, oob_score=False, random_state=47, verbose=0,
                       warm_start=False)

9.5.2 Evaluating the Model Accuracy

In [122]:
print('\n accuracy:')
print('---------------')
acc5=rf.score(X_test, y_test)
print(acc5)
 accuracy:
---------------
0.6625454545454545

9.5.3 Confusion Matrix

In [123]:
y_pred5 = rf.predict(X_test)
from sklearn.metrics import confusion_matrix
cm5 = confusion_matrix(y_true=y_test, y_pred=y_pred5)
cmDf5=pd.DataFrame(cm5, index=rf.classes_, columns=rf.classes_)
print('\nConfusion Matrix')
print('-------------------')
print(cmDf5)
Confusion Matrix
-------------------
      0     1
0  5030  2376
1   408   436
In [124]:
print("\033[1m The result is telling us that we have: ",(cm5[0,0]+cm5[1,1]),"correct predictions.")
print("\033[1m The result is telling us that we have: ",(cm5[0,1]+cm5[1,0]),"incorrect predictions.")
print("\033[1m We have a total predictions of: ",(cm5.sum()))
 The result is telling us that we have:  5466 correct predictions.
 The result is telling us that we have:  2784 incorrect predictions.
 We have a total predictions of:  8250

9.5.4 Classification Report

In [125]:
from sklearn.metrics import classification_report
print('\nclassification_report')
print('------------------------')
print(classification_report(y_test, y_pred5))
classification_report
------------------------
              precision    recall  f1-score   support

           0       0.92      0.68      0.78      7406
           1       0.16      0.52      0.24       844

    accuracy                           0.66      8250
   macro avg       0.54      0.60      0.51      8250
weighted avg       0.85      0.66      0.73      8250

9.5.5 Precision and Recall

In [126]:
per5=metrics.precision_score(y_test, y_pred5)
rec5=metrics.recall_score(y_test, y_pred5)
print("\033[1m Precision of the model:", "{:.2%}".format(per5))
print("\033[1m Recall of the model:", "{:.2%}".format(rec5))
 Precision of the model: 15.50%
 Recall of the model: 51.66%

9.5.6 The ROC Curve

In [127]:
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
roc5 = roc_auc_score(y_test, y_pred5)
fpr, tpr, thresholds = roc_curve(y_test, rf.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Random Forest (area = %0.4f)' % roc5)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()

9.5.7 ROC AUC

In [128]:
print("\033[1m The ROC AUC score using the Random Forest algorithm is:", "{:.4%}".format(roc5))
 The ROC AUC score using the Random Forest algorithm is: 59.7883%

9.6. Naïve Bayes

9.6.1 Training a Predictive Model

In [129]:
import sklearn.naive_bayes as nb
nbc = nb.GaussianNB()
nbc.fit(X_train, y_train)
Out[129]:
GaussianNB(priors=None, var_smoothing=1e-09)

9.6.2 Evaluating the Model Accuracy

In [130]:
print('\n accuracy:')
print('---------------')
acc6=nbc.score(X_test, y_test)
print(acc6)
 accuracy:
---------------
0.5151515151515151

9.6.3 Confusion Matrix

In [131]:
y_pred6 = nbc.predict(X_test)
from sklearn.metrics import confusion_matrix
cm6 = confusion_matrix(y_true=y_test, y_pred=y_pred6)
cmDf6=pd.DataFrame(cm6, index=nbc.classes_, columns=nbc.classes_)
print('\nConfusion Matrix')
print('-------------------')
print(cmDf6)
Confusion Matrix
-------------------
      0     1
0  3629  3777
1   223   621
In [132]:
print("\033[1m The result is telling us that we have: ",(cm6[0,0]+cm6[1,1]),"correct predictions.")
print("\033[1m The result is telling us that we have: ",(cm6[0,1]+cm6[1,0]),"incorrect predictions.")
print("\033[1m We have a total predictions of: ",(cm6.sum()))
 The result is telling us that we have:  4250 correct predictions.
 The result is telling us that we have:  4000 incorrect predictions.
 We have a total predictions of:  8250

9.6.4 Classification Report

In [133]:
from sklearn.metrics import classification_report
print('\nclassification_report')
print('------------------------')
print(classification_report(y_test, y_pred6))
classification_report
------------------------
              precision    recall  f1-score   support

           0       0.94      0.49      0.64      7406
           1       0.14      0.74      0.24       844

    accuracy                           0.52      8250
   macro avg       0.54      0.61      0.44      8250
weighted avg       0.86      0.52      0.60      8250

9.6.5 Precision and Recall

In [134]:
per6=metrics.precision_score(y_test, y_pred6)
rec6=metrics.recall_score(y_test, y_pred6)
print("\033[1m Precision of the model:", "{:.2%}".format(per6))
print("\033[1m Recall of the model:", "{:.2%}".format(rec6))
 Precision of the model: 14.12%
 Recall of the model: 73.58%

9.6.6 The ROC Curve

In [135]:
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
roc6 = roc_auc_score(y_test, y_pred6)
fpr, tpr, thresholds = roc_curve(y_test, nbc.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Naïve Bayes (area = %0.4f)' % roc6)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()

9.6.7 ROC AUC

In [136]:
print("\033[1m The ROC AUC score using the Naïve Bayes algorithm is:", "{:.4%}".format(roc6))
 The ROC AUC score using the Naïve Bayes algorithm is: 61.2895%

9.7. Extra Trees

We will create a SyntheticGridSearch for the Extra Trees in order to find the best hyperparameter of this algorithm

In [137]:
ET_roc = []
from sklearn.ensemble import ExtraTreesClassifier
from sklearn.metrics import roc_auc_score
for i in range (1,21):
    et = ExtraTreesClassifier(max_depth=i, random_state=47)
    et.fit(X_train, y_train)
    y_pred1 = et.predict(X_test)
    roc1 = roc_auc_score(y_test, y_pred1)
    print(i, roc1)
    ET_roc.append(roc1)
print(max(ET_roc))
1 0.554147687349696
2 0.5624442459233131
3 0.5621759544266017
4 0.5319647640634658
5 0.5191917210715534
6 0.5103899361731811
7 0.5072489898673165
8 0.5039207034644639
9 0.5017839704709771
10 0.5013417774495638
11 0.5006143347330779
12 0.501746854414187
13 0.5018666816837379
14 0.5005468219056407
15 0.49961684070684326
16 0.50171645764354
17 0.49978898241850783
18 0.5009738165417306
19 0.5010261309838443
20 0.5022632795491807
0.5624442459233131

We will refit an estimator using the best found parameter {'max_depth': 2}

9.7.1 Training a Predictive Model

In [138]:
from sklearn.ensemble import ExtraTreesClassifier
et = ExtraTreesClassifier(max_depth=2, random_state=47)
et.fit(X_train, y_train)
Out[138]:
ExtraTreesClassifier(bootstrap=False, ccp_alpha=0.0, class_weight=None,
                     criterion='gini', max_depth=2, max_features='auto',
                     max_leaf_nodes=None, max_samples=None,
                     min_impurity_decrease=0.0, min_impurity_split=None,
                     min_samples_leaf=1, min_samples_split=2,
                     min_weight_fraction_leaf=0.0, n_estimators=100,
                     n_jobs=None, oob_score=False, random_state=47, verbose=0,
                     warm_start=False)

9.7.2 Evaluating the Model Accuracy

In [139]:
print('\n accuracy:')
print('---------------')
acc7=et.score(X_test, y_test)
print(acc7)
 accuracy:
---------------
0.8043636363636364

9.7.3 Confusion Matrix

In [140]:
y_pred7 = et.predict(X_test)
from sklearn.metrics import confusion_matrix
cm7 = confusion_matrix(y_true=y_test, y_pred=y_pred7)
cmDf7=pd.DataFrame(cm7, index=et.classes_, columns=et.classes_)
print('\nConfusion Matrix')
print('-------------------')
print(cmDf7)
Confusion Matrix
-------------------
      0    1
0  6418  988
1   626  218
In [141]:
print("\033[1m The result is telling us that we have: ",(cm7[0,0]+cm7[1,1]),"correct predictions.")
print("\033[1m The result is telling us that we have: ",(cm7[0,1]+cm7[1,0]),"incorrect predictions.")
print("\033[1m We have a total predictions of: ",(cm7.sum()))
 The result is telling us that we have:  6636 correct predictions.
 The result is telling us that we have:  1614 incorrect predictions.
 We have a total predictions of:  8250

9.7.4 Classification Report

In [142]:
from sklearn.metrics import classification_report
print('\nclassification_report')
print('------------------------')
print(classification_report(y_test, y_pred7))
classification_report
------------------------
              precision    recall  f1-score   support

           0       0.91      0.87      0.89      7406
           1       0.18      0.26      0.21       844

    accuracy                           0.80      8250
   macro avg       0.55      0.56      0.55      8250
weighted avg       0.84      0.80      0.82      8250

9.7.5 Precision and Recall

In [143]:
per7=metrics.precision_score(y_test, y_pred7)
rec7=metrics.recall_score(y_test, y_pred7)
print("\033[1m Precision of the model:", "{:.2%}".format(per7))
print("\033[1m Recall of the model:", "{:.2%}".format(rec7))
 Precision of the model: 18.08%
 Recall of the model: 25.83%

9.7.6 The ROC Curve

In [144]:
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
roc7 = roc_auc_score(y_test, y_pred7)
fpr, tpr, thresholds = roc_curve(y_test, et.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Extra Trees (area = %0.4f)' % roc7)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()

9.7.7 ROC AUC

In [145]:
print("\033[1m The ROC AUC score using the Extra Trees Classifier algorithm is:", "{:.4%}".format(roc7))
 The ROC AUC score using the Extra Trees Classifier algorithm is: 56.2444%

9.8. Gradient Boosting

We will create a SyntheticGridSearch for the Gradient Boosting in order to find the best hyperparameter of this algorithm

In [146]:
GB_roc = []
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.metrics import roc_auc_score
for i in range (1,21):
    gb = GradientBoostingClassifier(max_depth=i, random_state=47)
    gb.fit(X_train, y_train)
    y_pred1 = gb.predict(X_test)
    roc1 = roc_auc_score(y_test, y_pred1)
    print(i, roc1)
    GB_roc.append(roc1)
print(max(GB_roc))
1 0.5263968755959367
2 0.5033434847881761
3 0.5023088747051514
4 0.5040338114478717
5 0.5051207359730102
6 0.5056827562639745
7 0.5102414719460204
8 0.5097840805392835
9 0.5096490548844091
10 0.5132710700815145
11 0.5098887094235109
12 0.5108558066790985
13 0.5102633896174871
14 0.5136001551195201
15 0.5101502816340793
16 0.5141688946966274
17 0.5117755489656779
18 0.5119222533798009
19 0.5094511559091962
20 0.5195187263305145
0.5263968755959367

We will refit an estimator using the best found parameter {'max_depth': 1}

9.8.1 Training a Predictive Model

In [147]:
from sklearn.ensemble import GradientBoostingClassifier
gb = GradientBoostingClassifier(max_depth=1, random_state=47)
gb.fit(X_train, y_train)
Out[147]:
GradientBoostingClassifier(ccp_alpha=0.0, criterion='friedman_mse', init=None,
                           learning_rate=0.1, loss='deviance', max_depth=1,
                           max_features=None, max_leaf_nodes=None,
                           min_impurity_decrease=0.0, min_impurity_split=None,
                           min_samples_leaf=1, min_samples_split=2,
                           min_weight_fraction_leaf=0.0, n_estimators=100,
                           n_iter_no_change=None, presort='deprecated',
                           random_state=47, subsample=1.0, tol=0.0001,
                           validation_fraction=0.1, verbose=0,
                           warm_start=False)

9.8.2 Evaluating the Model Accuracy

In [148]:
print('\n accuracy:')
print('---------------')
acc8=gb.score(X_test, y_test)
print(acc8)
 accuracy:
---------------
0.8696969696969697

9.8.3 Confusion Matrix

In [149]:
y_pred8 = gb.predict(X_test)
from sklearn.metrics import confusion_matrix
cm8 = confusion_matrix(y_true=y_test, y_pred=y_pred8)
cmDf8=pd.DataFrame(cm8, index=gb.classes_, columns=gb.classes_)
print('\nConfusion Matrix')
print('-------------------')
print(cmDf8)
Confusion Matrix
-------------------
      0    1
0  7095  311
1   764   80
In [150]:
print("\033[1m The result is telling us that we have: ",(cm8[0,0]+cm8[1,1]),"correct predictions.")
print("\033[1m The result is telling us that we have: ",(cm8[0,1]+cm8[1,0]),"incorrect predictions.")
print("\033[1m We have a total predictions of: ",(cm8.sum()))
 The result is telling us that we have:  7175 correct predictions.
 The result is telling us that we have:  1075 incorrect predictions.
 We have a total predictions of:  8250

9.8.4 Classification Report

In [151]:
from sklearn.metrics import classification_report
print('\nclassification_report')
print('------------------------')
print(classification_report(y_test, y_pred8))
classification_report
------------------------
              precision    recall  f1-score   support

           0       0.90      0.96      0.93      7406
           1       0.20      0.09      0.13       844

    accuracy                           0.87      8250
   macro avg       0.55      0.53      0.53      8250
weighted avg       0.83      0.87      0.85      8250

9.8.5 Precision and Recall

In [152]:
per8=metrics.precision_score(y_test, y_pred8)
rec8=metrics.recall_score(y_test, y_pred8)
print("\033[1m Precision of the model:", "{:.2%}".format(per8))
print("\033[1m Recall of the model:", "{:.2%}".format(rec8))
 Precision of the model: 20.46%
 Recall of the model: 9.48%

9.8.6 The ROC Curve

In [153]:
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
roc8 = roc_auc_score(y_test, y_pred8)
fpr, tpr, thresholds = roc_curve(y_test, gb.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Gradient Boosting (area = %0.4f)' % roc8)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()

9.8.7 ROC AUC

In [154]:
print("\033[1m The ROC AUC score using the Gradient Boosting Classifier algorithm is:", "{:.4%}".format(roc8))
 The ROC AUC score using the Gradient Boosting Classifier algorithm is: 52.6397%

9.9. AdaBoost

We will create a SyntheticGridSearch for the AdaBoost in order to find the best hyperparameter of this algorithm

In [155]:
Ada_roc = []
from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import roc_auc_score
num_estimators = [1, 10, 20, 40, 60, 80, 100]
for i in num_estimators:
    ad = AdaBoostClassifier(n_estimators=i, random_state=47)
    ad.fit(X_train, y_train)
    y_pred1 = ad.predict(X_test)
    roc1 = roc_auc_score(y_test, y_pred1)
    print(i, roc1)
    Ada_roc.append(roc1)
print(max(Ada_roc))
1 0.6226568889321198
10 0.5319942009360925
20 0.5044385684464883
40 0.5075543974208181
60 0.4992573588981907
80 0.5011544373525757
100 0.5012219501800129
0.6226568889321198

We will refit an estimator using the best found parameter {'n_estimators': 1}

9.9.1 Training a Predictive Model

In [156]:
from sklearn.ensemble import AdaBoostClassifier
ad = AdaBoostClassifier(n_estimators=1, random_state=47)
ad.fit(X_train, y_train)
Out[156]:
AdaBoostClassifier(algorithm='SAMME.R', base_estimator=None, learning_rate=1.0,
                   n_estimators=1, random_state=47)

9.9.2 Evaluating the Model Accuracy

In [157]:
print('\n accuracy:')
print('---------------')
acc9=ad.score(X_test, y_test)
print(acc9)
 accuracy:
---------------
0.4723636363636364

9.9.3 Confusion Matrix

In [158]:
y_pred9 = ad.predict(X_test)
from sklearn.metrics import confusion_matrix
cm9 = confusion_matrix(y_true=y_test, y_pred=y_pred9)
cmDf9=pd.DataFrame(cm9, index=ad.classes_, columns=ad.classes_)
print('\nConfusion Matrix')
print('-------------------')
print(cmDf9)
Confusion Matrix
-------------------
      0     1
0  3212  4194
1   159   685
In [159]:
print("\033[1m The result is telling us that we have: ",(cm9[0,0]+cm9[1,1]),"correct predictions.")
print("\033[1m The result is telling us that we have: ",(cm9[0,1]+cm9[1,0]),"incorrect predictions.")
print("\033[1m We have a total predictions of: ",(cm9.sum()))
 The result is telling us that we have:  3897 correct predictions.
 The result is telling us that we have:  4353 incorrect predictions.
 We have a total predictions of:  8250

9.9.4 Classification Report

In [160]:
from sklearn.metrics import classification_report
print('\nclassification_report')
print('------------------------')
print(classification_report(y_test, y_pred9))
classification_report
------------------------
              precision    recall  f1-score   support

           0       0.95      0.43      0.60      7406
           1       0.14      0.81      0.24       844

    accuracy                           0.47      8250
   macro avg       0.55      0.62      0.42      8250
weighted avg       0.87      0.47      0.56      8250

9.9.5 Precision and Recall

In [161]:
per9=metrics.precision_score(y_test, y_pred9)
rec9=metrics.recall_score(y_test, y_pred9)
print("\033[1m Precision of the model:", "{:.2%}".format(per9))
print("\033[1m Recall of the model:", "{:.2%}".format(rec9))
 Precision of the model: 14.04%
 Recall of the model: 81.16%

9.9.6 The ROC Curve

In [162]:
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
roc9 = roc_auc_score(y_test, y_pred9)
fpr, tpr, thresholds = roc_curve(y_test, ad.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='AdaBoost (area = %0.4f)' % roc9)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()

9.9.7 ROC AUC

In [163]:
print("\033[1m The ROC AUC score using the AdaBoost Classifier algorithm is:", "{:.4%}".format(roc9))
 The ROC AUC score using the AdaBoost Classifier algorithm is: 62.2657%

9.10 XGBoost

We will create a SyntheticGridSearch for the XGBoost in order to find the best hyperparameter of this algorithm

In [164]:
XG_roc = []
import re
regex = re.compile(r"\[|\]|<", re.IGNORECASE)
X_train.columns = [regex.sub("_", col) if any(x in str(col) for x in set(('[', ']', '<'))) else col for col in X_train.columns.values]
X_test.columns = [regex.sub("_", col) if any(x in str(col) for x in set(('[', ']', '<'))) else col for col in X_test.columns.values]
from xgboost import XGBClassifier
from sklearn.metrics import roc_auc_score
for i in range(1,21):
    xg = XGBClassifier(max_depth=i, random_state=47)
    xg.fit(X_train, y_train)
    y_pred1 = xg.predict(X_test)
    roc1 = roc_auc_score(y_test, y_pred1)
    print(i, roc1)
    XG_roc.append(roc1)
print(max(XG_roc))
1 0.5085165352033
2 0.5014920014897617
3 0.49949701343729247
4 0.5051055375876867
5 0.5128659931168913
6 0.5083815095484255
7 0.514357994606653
8 0.5193065888679986
9 0.5191715632131243
10 0.5183614092838776
11 0.5168018949666787
12 0.516412016387379
13 0.518309094841764
14 0.5190584552297165
15 0.5160069394227558
16 0.514822105299533
17 0.516074452250193
18 0.513876925715412
19 0.5179867290899016
20 0.5131646813842498
0.5193065888679986

We will refit an estimator using the best found parameter {'max_depth': 8}

9.10.1 Training a Predictive Model

In [165]:
import re
regex = re.compile(r"\[|\]|<", re.IGNORECASE)
X_train.columns = [regex.sub("_", col) if any(x in str(col) for x in set(('[', ']', '<'))) else col for col in X_train.columns.values]
X_test.columns = [regex.sub("_", col) if any(x in str(col) for x in set(('[', ']', '<'))) else col for col in X_test.columns.values]
from xgboost import XGBClassifier
xg = XGBClassifier(max_depth=8, random_state=47)
xg.fit(X_train, y_train)
Out[165]:
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=1, gamma=0, gpu_id=-1,
              importance_type='gain', interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=8,
              min_child_weight=1, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=0, num_parallel_tree=1,
              objective='binary:logistic', random_state=47, reg_alpha=0,
              reg_lambda=1, scale_pos_weight=1, subsample=1,
              tree_method='exact', validate_parameters=1, verbosity=None)

9.10.2 Evaluating the Model Accuracy

In [166]:
print('\n accuracy:')
print('---------------')
acc10=xg.score(X_test, y_test)
print(acc10)
 accuracy:
---------------
0.8861818181818182

9.10.3 Confusion Matrix

In [167]:
y_pred10 = xg.predict(X_test)
from sklearn.metrics import confusion_matrix
cm10 = confusion_matrix(y_true=y_test, y_pred=y_pred10)
cmDf10=pd.DataFrame(cm10, index=xg.classes_, columns=xg.classes_)
print('\nConfusion Matrix')
print('-------------------')
print(cmDf10)
Confusion Matrix
-------------------
      0    1
0  7262  144
1   795   49
In [168]:
print("\033[1m The result is telling us that we have: ",(cm10[0,0]+cm10[1,1]),"correct predictions.")
print("\033[1m The result is telling us that we have: ",(cm10[0,1]+cm10[1,0]),"incorrect predictions.")
print("\033[1m We have a total predictions of: ",(cm10.sum()))
 The result is telling us that we have:  7311 correct predictions.
 The result is telling us that we have:  939 incorrect predictions.
 We have a total predictions of:  8250

9.10.4 Classification Report

In [169]:
from sklearn.metrics import classification_report
print('\nclassification_report')
print('------------------------')
print(classification_report(y_test, y_pred10))
classification_report
------------------------
              precision    recall  f1-score   support

           0       0.90      0.98      0.94      7406
           1       0.25      0.06      0.09       844

    accuracy                           0.89      8250
   macro avg       0.58      0.52      0.52      8250
weighted avg       0.84      0.89      0.85      8250

9.10.5 Precision and Recall

In [170]:
per10=metrics.precision_score(y_test, y_pred10)
rec10=metrics.recall_score(y_test, y_pred10)
print("\033[1m Precision of the model:", "{:.2%}".format(per10))
print("\033[1m Recall of the model:", "{:.2%}".format(rec10))
 Precision of the model: 25.39%
 Recall of the model: 5.81%

9.10.6 The ROC Curve

In [171]:
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
roc10 = roc_auc_score(y_test, y_pred10)
fpr, tpr, thresholds = roc_curve(y_test, xg.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='XGBoost (area = %0.4f)' % roc10)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()

9.10.7 ROC AUC

In [172]:
print("\033[1m The ROC AUC score using the XGBoost Classifier algorithm is:", "{:.4%}".format(roc10))
 The ROC AUC score using the XGBoost Classifier algorithm is: 51.9307%

10. Comparing several machine learning classification models

10.1 Comparing models

In [173]:
models = ['K-Nearest Neighbors','Logistic Regression', 
          'Support Vector Machine Classifier', 'Decision Tree Classifier',
          'Random Forest Classifier', 'Naïve Bayes Classifier',
          'Extra Trees Classifier', 'Gradient Boosting Classifier', 
          'AdaBoost Classifier', 'XGBoost Boosting Classifier'] 
tests_roc = [roc1, roc2, roc3, roc4, roc5, roc6, roc7, roc8, roc9, roc10]
In [174]:
compare_models = pd.DataFrame({ "Algorithms": models, "ROC AUC": tests_roc})
compare_models.sort_values(by = "ROC AUC", ascending = False)
Out[174]:
Algorithms ROC AUC
3 Decision Tree Classifier 0.622657
8 AdaBoost Classifier 0.622657
5 Naïve Bayes Classifier 0.612895
1 Logistic Regression 0.609145
4 Random Forest Classifier 0.597883
6 Extra Trees Classifier 0.562444
2 Support Vector Machine Classifier 0.537129
7 Gradient Boosting Classifier 0.526397
9 XGBoost Boosting Classifier 0.519307
0 K-Nearest Neighbors 0.513165
In [175]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.figure(figsize=(8,8))
sns.barplot(x = "ROC AUC", y = "Algorithms", data = compare_models)
plt.show()

From the above 10 machine learning models, we choose the Decision Tree which gave us the best results.

10.2 Feature Importance in the chosen model

In [176]:
default_features = [x for i,x in enumerate(cols) if i!=30]

def plot_feature_importances_default(model):
    plt.figure(figsize=(10,10))
    n_features = len(cols)
    plt.barh(range(n_features), model.feature_importances_, align='center')
    plt.yticks(np.arange(n_features), default_features)
    plt.xlabel("Feature importance")
    plt.ylabel("Feature")
    plt.ylim(-1, n_features)
    
plot_feature_importances_default(dt)
plt.savefig('feature_importance')

11. Training & Evaluating different ensemble learning classification methods

11.1. Simple Averaging Approach

11.1.1 Training a Predictive Model

In [177]:
ET_clf = et
RF_clf = rf
DT_clf = dt

ET_clf.fit(X_train, y_train)
RF_clf.fit(X_train, y_train)
DT_clf.fit(X_train, y_train)
Out[177]:
DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=1, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=47, splitter='best')

11.1.2 Evaluating the Model Accuracy

In [178]:
ET_pred = ET_clf.predict(X_test)
RF_pred = RF_clf.predict(X_test)
DT_pred = DT_clf.predict(X_test)
from sklearn.metrics import accuracy_score
averaged_preds = (ET_pred + RF_pred + DT_pred)//3
print('\n accuracy:')
print('---------------')
acc11=accuracy_score(y_test,averaged_preds)
print(acc11)
 accuracy:
---------------
0.822060606060606

11.1.3 Confusion Matrix

In [179]:
averaged_preds = (ET_pred + RF_pred + DT_pred)//3
from sklearn.metrics import confusion_matrix
cm11 = confusion_matrix(y_true=y_test, y_pred=averaged_preds)
cmDf11=pd.DataFrame(cm11, index=DT_clf.classes_, columns=DT_clf.classes_)
print('\nConfusion Matrix')
print('-------------------')
print(cmDf11)
Confusion Matrix
-------------------
      0    1
0  6577  829
1   639  205
In [180]:
print("\033[1m The result is telling us that we have: ",(cm11[0,0]+cm11[1,1]),"correct predictions.")
print("\033[1m The result is telling us that we have: ",(cm11[0,1]+cm11[1,0]),"incorrect predictions.")
print("\033[1m We have a total predictions of: ",(cm11.sum()))
 The result is telling us that we have:  6782 correct predictions.
 The result is telling us that we have:  1468 incorrect predictions.
 We have a total predictions of:  8250

11.1.4 Classification Report

In [181]:
from sklearn.metrics import classification_report
print('\nclassification_report')
print('------------------------')
print(classification_report(y_test, averaged_preds))
classification_report
------------------------
              precision    recall  f1-score   support

           0       0.91      0.89      0.90      7406
           1       0.20      0.24      0.22       844

    accuracy                           0.82      8250
   macro avg       0.55      0.57      0.56      8250
weighted avg       0.84      0.82      0.83      8250

11.1.5 Precision and Recall

In [182]:
per11=metrics.precision_score(y_test, averaged_preds)
rec11=metrics.recall_score(y_test, averaged_preds)
print("\033[1m Precision of the model:", "{:.2%}".format(per11))
print("\033[1m Recall of the model:", "{:.2%}".format(rec11))
 Precision of the model: 19.83%
 Recall of the model: 24.29%

11.1.6 The ROC Curve

In [183]:
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
roc11 = roc_auc_score(y_test, averaged_preds)

average_proba = (ET_clf.predict_proba(X_test)[:,1]+ RF_clf.predict_proba(X_test)[:,1]+ DT_clf.predict_proba(X_test)[:,1])//3

fpr, tpr, thresholds = roc_curve(y_test, average_proba)
plt.figure()
plt.plot(fpr, tpr, label='Simple Averaging Approach (area = %0.4f)' % roc11)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()

11.1.6 ROC AUC

In [184]:
print("\033[1m The ROC AUC score using the Simple Averaging Approach is:", "{:.4%}".format(roc11))
 The ROC AUC score using the Simple Averaging Approach is: 56.5477%

11.2. Voting/Stacking Classification

11.2.1 Training a Predictive Model

In [185]:
voting_clf = VotingClassifier(estimators=[('ET', ET_clf), ('RF', RF_clf), ('DT', DT_clf)], voting='soft')
voting_clf.fit(X_train, y_train)
Out[185]:
VotingClassifier(estimators=[('ET',
                              ExtraTreesClassifier(bootstrap=False,
                                                   ccp_alpha=0.0,
                                                   class_weight=None,
                                                   criterion='gini',
                                                   max_depth=2,
                                                   max_features='auto',
                                                   max_leaf_nodes=None,
                                                   max_samples=None,
                                                   min_impurity_decrease=0.0,
                                                   min_impurity_split=None,
                                                   min_samples_leaf=1,
                                                   min_samples_split=2,
                                                   min_weight_fraction_leaf=0.0,
                                                   n_estimators=100,
                                                   n_jobs=None, oob_score=Fal...
                              DecisionTreeClassifier(ccp_alpha=0.0,
                                                     class_weight=None,
                                                     criterion='gini',
                                                     max_depth=1,
                                                     max_features=None,
                                                     max_leaf_nodes=None,
                                                     min_impurity_decrease=0.0,
                                                     min_impurity_split=None,
                                                     min_samples_leaf=1,
                                                     min_samples_split=2,
                                                     min_weight_fraction_leaf=0.0,
                                                     presort='deprecated',
                                                     random_state=47,
                                                     splitter='best'))],
                 flatten_transform=True, n_jobs=None, voting='soft',
                 weights=None)

11.2.2 Evaluating the Model Accuracy

In [186]:
print('\n accuracy:')
print('---------------')
acc12=voting_clf.score(X_test, y_test)
print(acc12)
 accuracy:
---------------
0.5410909090909091

11.2.3 Confusion Matrix

In [187]:
voting_preds = voting_clf.predict(X_test)
from sklearn.metrics import confusion_matrix
cm12 = confusion_matrix(y_true=y_test, y_pred=voting_preds)
cmDf12=pd.DataFrame(cm12, index=voting_clf.classes_, columns=voting_clf.classes_)
print('\nConfusion Matrix')
print('-------------------')
print(cmDf12)
Confusion Matrix
-------------------
      0     1
0  3861  3545
1   241   603
In [188]:
print("\033[1m The result is telling us that we have: ",(cm12[0,0]+cm12[1,1]),"correct predictions.")
print("\033[1m The result is telling us that we have: ",(cm12[0,1]+cm12[1,0]),"incorrect predictions.")
print("\033[1m We have a total predictions of: ",(cm12.sum()))
 The result is telling us that we have:  4464 correct predictions.
 The result is telling us that we have:  3786 incorrect predictions.
 We have a total predictions of:  8250

11.2.4 Classification Report

In [189]:
from sklearn.metrics import classification_report
print('\nclassification_report')
print('------------------------')
print(classification_report(y_test, voting_preds))
classification_report
------------------------
              precision    recall  f1-score   support

           0       0.94      0.52      0.67      7406
           1       0.15      0.71      0.24       844

    accuracy                           0.54      8250
   macro avg       0.54      0.62      0.46      8250
weighted avg       0.86      0.54      0.63      8250

11.2.5 Precision and Recall

In [190]:
per12=metrics.precision_score(y_test, voting_preds)
rec12=metrics.recall_score(y_test, voting_preds)
print("\033[1m Precision of the model:", "{:.2%}".format(per12))
print("\033[1m Recall of the model:", "{:.2%}".format(rec12))
 Precision of the model: 14.54%
 Recall of the model: 71.45%

11.2.6 The ROC Curve

In [191]:
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
roc12 = roc_auc_score(y_test, voting_preds)
fpr, tpr, thresholds = roc_curve(y_test, voting_clf.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Voting\Stacking Classification (area = %0.4f)' % roc12)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()

11.2.7 ROC AUC

In [192]:
print("\033[1m The ROC AUC score using the Voting/Stacking Classification is:", "{:.4%}".format(roc12))
 The ROC AUC score using the Voting/Stacking Classification is: 61.7895%

11.3. Bagging Classification

We will create a SyntheticGridSearch for the Bagging Classification in order to find the best hyperparameter of this algorithm

In [193]:
Bagg_roc = []
from sklearn.ensemble import BaggingClassifier
from sklearn.metrics import roc_auc_score
num_estimators = [1, 10, 20, 40, 60, 80, 100]
for i in num_estimators:
    DT_Bagg = BaggingClassifier(base_estimator=DT_clf, n_estimators=i, random_state=47)
    RF_Bagg = BaggingClassifier(base_estimator=RF_clf, n_estimators=i, random_state=47)
    ET_Bagg = BaggingClassifier(base_estimator=ET_clf, n_estimators=i, random_state=47)
    DT_Bagg.fit(X_train, y_train)
    RF_Bagg.fit(X_train, y_train)
    ET_Bagg.fit(X_train, y_train)
    voting_Bagg = VotingClassifier(estimators=[('ET', ET_Bagg), ('RF', RF_Bagg), ('DT', DT_Bagg)], voting='soft')
    voting_Bagg.fit(X_train, y_train)
    y_pred1 = voting_Bagg.predict(X_test)
    roc1 = roc_auc_score(y_test, y_pred1)
    print(i, roc1)
    Bagg_roc.append(roc1)
print(max(Bagg_roc))
1 0.616684563431981
10 0.6190068767094185
20 0.619390036002575
40 0.6192988456906339
60 0.6188043382271068
80 0.6192921264044908
100 0.6190068767094185
0.619390036002575

We will refit an estimator using the best found parameter {'n_estimators': 20}

11.3.1 Training a Predictive Model

In [194]:
from sklearn.ensemble import BaggingClassifier
DT_Bagg = BaggingClassifier(base_estimator=DT_clf, n_estimators=20, random_state=47)
RF_Bagg = BaggingClassifier(base_estimator=RF_clf, n_estimators=20, random_state=47)
ET_Bagg = BaggingClassifier(base_estimator=ET_clf, n_estimators=20, random_state=47)

DT_Bagg.fit(X_train, y_train)
RF_Bagg.fit(X_train, y_train)
ET_Bagg.fit(X_train, y_train)


voting_Bagg = VotingClassifier(estimators=[('ET', ET_Bagg), ('RF', RF_Bagg), ('DT', DT_Bagg)], voting='soft')
voting_Bagg.fit(X_train, y_train)
Out[194]:
VotingClassifier(estimators=[('ET',
                              BaggingClassifier(base_estimator=ExtraTreesClassifier(bootstrap=False,
                                                                                    ccp_alpha=0.0,
                                                                                    class_weight=None,
                                                                                    criterion='gini',
                                                                                    max_depth=2,
                                                                                    max_features='auto',
                                                                                    max_leaf_nodes=None,
                                                                                    max_samples=None,
                                                                                    min_impurity_decrease=0.0,
                                                                                    min_impurity_split=None,
                                                                                    min_samples_leaf=1,
                                                                                    min_samples_split=2,
                                                                                    min_weight_fraction_leaf=0.0,
                                                                                    n_estimat...
                                                                                      min_impurity_split=None,
                                                                                      min_samples_leaf=1,
                                                                                      min_samples_split=2,
                                                                                      min_weight_fraction_leaf=0.0,
                                                                                      presort='deprecated',
                                                                                      random_state=47,
                                                                                      splitter='best'),
                                                bootstrap=True,
                                                bootstrap_features=False,
                                                max_features=1.0,
                                                max_samples=1.0,
                                                n_estimators=20, n_jobs=None,
                                                oob_score=False,
                                                random_state=47, verbose=0,
                                                warm_start=False))],
                 flatten_transform=True, n_jobs=None, voting='soft',
                 weights=None)

11.3.2 Evaluating the Model Accuracy

In [195]:
print('\n accuracy:')
print('---------------')
acc13=voting_Bagg.score(X_test, y_test)
print(acc13)
 accuracy:
---------------
0.556969696969697

11.3.3 Confusion Matrix

In [196]:
voting_preds1 = voting_Bagg.predict(X_test)
from sklearn.metrics import confusion_matrix
cm13 = confusion_matrix(y_true=y_test, y_pred=voting_preds1)
cmDf13=pd.DataFrame(cm13, index=voting_Bagg.classes_, columns=voting_Bagg.classes_)
print('\nConfusion Matrix')
print('-------------------')
print(cmDf13)
Confusion Matrix
-------------------
      0     1
0  4006  3400
1   255   589
In [197]:
print("\033[1m The result is telling us that we have: ",(cm13[0,0]+cm13[1,1]),"correct predictions.")
print("\033[1m The result is telling us that we have: ",(cm13[0,1]+cm13[1,0]),"incorrect predictions.")
print("\033[1m We have a total predictions of: ",(cm13.sum()))
 The result is telling us that we have:  4595 correct predictions.
 The result is telling us that we have:  3655 incorrect predictions.
 We have a total predictions of:  8250

11.3.4 Classification Report

In [198]:
from sklearn.metrics import classification_report
print('\nclassification_report')
print('------------------------')
print(classification_report(y_test, voting_preds1))
classification_report
------------------------
              precision    recall  f1-score   support

           0       0.94      0.54      0.69      7406
           1       0.15      0.70      0.24       844

    accuracy                           0.56      8250
   macro avg       0.54      0.62      0.47      8250
weighted avg       0.86      0.56      0.64      8250

11.3.5 Precision and Recall

In [199]:
per13=metrics.precision_score(y_test, voting_preds1)
rec13=metrics.recall_score(y_test, voting_preds1)
print("\033[1m Precision of the model:", "{:.2%}".format(per13))
print("\033[1m Recall of the model:", "{:.2%}".format(rec13))
 Precision of the model: 14.77%
 Recall of the model: 69.79%

11.3.6 The ROC Curve

In [200]:
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
roc13 = roc_auc_score(y_test, voting_preds1)
fpr, tpr, thresholds = roc_curve(y_test, voting_Bagg.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Bagging Classification (area = %0.4f)' % roc13)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()

11.3.7 ROC AUC

In [201]:
print("\033[1m The ROC AUC score using the Bagging Classification is:", "{:.4%}".format(roc13))
 The ROC AUC score using the Bagging Classification is: 61.9390%

11.4. Boosting Classification

11.4.1 Training a Predictive Model

In [202]:
from sklearn.ensemble import AdaBoostClassifier
from xgboost import XGBClassifier
from sklearn.ensemble import GradientBoostingClassifier
import re
regex = re.compile(r"\[|\]|<", re.IGNORECASE)
X_train.columns = [regex.sub("_", col) if any(x in str(col) for x in set(('[', ']', '<'))) else col for col in X_train.columns.values]
X_test.columns = [regex.sub("_", col) if any(x in str(col) for x in set(('[', ']', '<'))) else col for col in X_test.columns.values]
from sklearn.ensemble import BaggingClassifier
Ada_clf = ad
XG_clf = xg
Grad_clf = gb

Ada_clf.fit(X_train, y_train)
XG_clf.fit(X_train, y_train)
Grad_clf.fit(X_train, y_train)


voting_Boost = VotingClassifier(estimators=[('Grad', Grad_clf), ('XG', XG_clf), ('Ada', Ada_clf)], voting='soft')
voting_Boost.fit(X_train, y_train)
Out[202]:
VotingClassifier(estimators=[('Grad',
                              GradientBoostingClassifier(ccp_alpha=0.0,
                                                         criterion='friedman_mse',
                                                         init=None,
                                                         learning_rate=0.1,
                                                         loss='deviance',
                                                         max_depth=1,
                                                         max_features=None,
                                                         max_leaf_nodes=None,
                                                         min_impurity_decrease=0.0,
                                                         min_impurity_split=None,
                                                         min_samples_leaf=1,
                                                         min_samples_split=2,
                                                         min_weight_fraction_leaf=0.0,
                                                         n_estimators=100,
                                                         n_iter_no_change=N...
                                            num_parallel_tree=1,
                                            objective='binary:logistic',
                                            random_state=47, reg_alpha=0,
                                            reg_lambda=1, scale_pos_weight=1,
                                            subsample=1, tree_method='exact',
                                            validate_parameters=1,
                                            verbosity=None)),
                             ('Ada',
                              AdaBoostClassifier(algorithm='SAMME.R',
                                                 base_estimator=None,
                                                 learning_rate=1.0,
                                                 n_estimators=1,
                                                 random_state=47))],
                 flatten_transform=True, n_jobs=None, voting='soft',
                 weights=None)

11.4.2 Evaluating the Model Accuracy

In [203]:
print('\n accuracy:')
print('---------------')
acc14=voting_Boost.score(X_test, y_test)
print(acc14)
 accuracy:
---------------
0.8823030303030303

11.4.3 Confusion Matrix

In [204]:
voting_preds2 = voting_Boost.predict(X_test)
from sklearn.metrics import confusion_matrix
cm14 = confusion_matrix(y_test, voting_preds2)
cmDf14=pd.DataFrame(cm14, index=voting_Boost.classes_, columns=voting_Boost.classes_)
print('\nConfusion Matrix')
print('-------------------')
print(cmDf14)
Confusion Matrix
-------------------
      0    1
0  7208  198
1   773   71
In [205]:
print("\033[1m The result is telling us that we have: ",(cm14[0,0]+cm14[1,1]),"correct predictions.")
print("\033[1m The result is telling us that we have: ",(cm14[0,1]+cm14[1,0]),"incorrect predictions.")
print("\033[1m We have a total predictions of: ",(cm14.sum()))
 The result is telling us that we have:  7279 correct predictions.
 The result is telling us that we have:  971 incorrect predictions.
 We have a total predictions of:  8250

11.4.4 Classification Report

In [206]:
from sklearn.metrics import classification_report
print('\nclassification_report')
print('------------------------')
print(classification_report(y_test, voting_preds2))
classification_report
------------------------
              precision    recall  f1-score   support

           0       0.90      0.97      0.94      7406
           1       0.26      0.08      0.13       844

    accuracy                           0.88      8250
   macro avg       0.58      0.53      0.53      8250
weighted avg       0.84      0.88      0.85      8250

11.4.5 Precision and Recall

In [207]:
per14=metrics.precision_score(y_test, voting_preds2)
rec14=metrics.recall_score(y_test, voting_preds2)
print("\033[1m Precision of the model:", "{:.2%}".format(per14))
print("\033[1m Recall of the model:", "{:.2%}".format(rec14))
 Precision of the model: 26.39%
 Recall of the model: 8.41%

11.4.6 The ROC Curve

In [208]:
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
roc14 = roc_auc_score(y_test, voting_preds2)
fpr, tpr, thresholds = roc_curve(y_test, voting_Boost.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Boosting Classification (area = %0.4f)' % roc14)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()

11.4.7 ROC AUC

In [209]:
print("\033[1m The ROC AUC score using the Boosting Classification is:", "{:.4%}".format(roc14))
 The ROC AUC score using the Boosting Classification is: 52.8694%

12. Comparing several ensemble learning classification methods

12.1 Comparing methods

In [210]:
methods = ['Simple Averaging Approach','Voting/Stacking Classification', 
          'Bagging Classification','Boosting Classification']
tests_roc = [roc11, roc12, roc13, roc14]
In [211]:
compare_models = pd.DataFrame({ "Methods": methods, "ROC AUC": tests_roc})
compare_models.sort_values(by = "ROC AUC", ascending = False)
Out[211]:
Methods ROC AUC
2 Bagging Classification 0.619390
1 Voting/Stacking Classification 0.617895
0 Simple Averaging Approach 0.565477
3 Boosting Classification 0.528694
In [212]:
import matplotlib.pyplot as plt
%matplotlib inline
plt.figure(figsize=(8,8))
sns.barplot(x = "ROC AUC", y = "Methods", data = compare_models)
plt.show()

From the above 4 ensemble learning methods, we choose the Bagging Classification which gave us the best results.

13. Training & Evaluating deep learning classification model

13.1. Multi-layer Perceptron

We will create a SyntheticGridSearch for the Multi-layer Perceptron in order to find the best hyperparameter of this algorithm

In [213]:
MLP_roc = [] 
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import roc_auc_score
max_iter= [1, 10, 20, 40, 60, 80, 100]
for i in max_iter:
    mlp = MLPClassifier(max_iter=i, random_state=47)
    mlp.fit(X_train, y_train)
    y_pred1 = mlp.predict(X_test)
    roc1 = roc_auc_score(y_test, y_pred1)
    print(i, roc1)
    MLP_roc.append(roc1)
print(max(MLP_roc))
1 0.5613682002424062
10 0.5029907222656664
20 0.5027984226955728
40 0.5708268753527626
60 0.5391107248765891
80 0.4999324871725628
100 0.578194572608606
0.578194572608606

We will refit an estimator using the best found parameter {'max_iter': 100}

13.1.1 Training a Predictive Model

In [214]:
from sklearn.neural_network import MLPClassifier
mlp = MLPClassifier(max_iter=100, random_state=47)
mlp.fit(X_train, y_train)
Out[214]:
MLPClassifier(activation='relu', alpha=0.0001, batch_size='auto', beta_1=0.9,
              beta_2=0.999, early_stopping=False, epsilon=1e-08,
              hidden_layer_sizes=(100,), learning_rate='constant',
              learning_rate_init=0.001, max_fun=15000, max_iter=100,
              momentum=0.9, n_iter_no_change=10, nesterovs_momentum=True,
              power_t=0.5, random_state=47, shuffle=True, solver='adam',
              tol=0.0001, validation_fraction=0.1, verbose=False,
              warm_start=False)

13.1.2 Evaluating the Model Accuracy

In [215]:
print('\n accuracy:')
print('---------------')
acc15=mlp.score(X_test, y_test)
print(acc15)
 accuracy:
---------------
0.7836363636363637

13.1.3 Confusion Matrix

In [216]:
y_pred15 = mlp.predict(X_test)
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm15 = confusion_matrix(y_test, y_pred15)
cmDf15=pd.DataFrame(cm15, index=mlp.classes_, columns=mlp.classes_)
print('\nConfusion Matrix')
print('-------------------')
print(cmDf15)
Confusion Matrix
-------------------
      0     1
0  6195  1211
1   574   270
In [217]:
print("\033[1m The result is telling us that we have: ",(cm15[0,0]+cm15[1,1]),"correct predictions.")
print("\033[1m The result is telling us that we have: ",(cm15[0,1]+cm15[1,0]),"incorrect predictions.")
print("\033[1m We have a total predictions of: ",(cm15.sum()))
 The result is telling us that we have:  6465 correct predictions.
 The result is telling us that we have:  1785 incorrect predictions.
 We have a total predictions of:  8250

13.1.4 Classification Report

In [218]:
from sklearn.metrics import classification_report
print('\nclassification_report')
print('------------------------')
print(classification_report(y_test, y_pred15))
classification_report
------------------------
              precision    recall  f1-score   support

           0       0.92      0.84      0.87      7406
           1       0.18      0.32      0.23       844

    accuracy                           0.78      8250
   macro avg       0.55      0.58      0.55      8250
weighted avg       0.84      0.78      0.81      8250

13.1.5 Precision and Recall

In [219]:
per15=metrics.precision_score(y_test, y_pred15)
rec15=metrics.recall_score(y_test, y_pred15)
print("\033[1m Precision of the model:", "{:.2%}".format(per15))
print("\033[1m Recall of the model:", "{:.2%}".format(rec15))
 Precision of the model: 18.23%
 Recall of the model: 31.99%

13.1.6 The ROC Curve

In [220]:
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
roc15 = roc_auc_score(y_test, y_pred15)
fpr, tpr, thresholds = roc_curve(y_test, mlp.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label='Multi-layer Perceptron (area = %0.4f)' % roc15)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()

13.1.7 ROC AUC

In [221]:
print("\033[1m The ROC AUC score using the Multi-layer Perceptron algorithm is:", "{:.4%}".format(roc15))
 The ROC AUC score using the Multi-layer Perceptron algorithm is: 57.8195%

14. Choosing the best model

In [222]:
models = ['Decision Tree','Bagging Classification', 
          'Multi-layer Perceptron']
tests_roc = [roc4, roc13, roc15]
In [223]:
compare_models = pd.DataFrame({ "Algorithms": models, "ROC AUC": tests_roc})
compare_models.sort_values(by = "ROC AUC", ascending = False)
Out[223]:
Algorithms ROC AUC
0 Decision Tree 0.622657
1 Bagging Classification 0.619390
2 Multi-layer Perceptron 0.578195

From the above 3 models, we choose the Decision Tree which gave us the best results.

15. Predict+Evaluate on test set with chosen model

15.1.1 Decision Tree Model Fitting

In [224]:
dt.fit(X_train, y_train)
Out[224]:
DecisionTreeClassifier(ccp_alpha=0.0, class_weight=None, criterion='gini',
                       max_depth=1, max_features=None, max_leaf_nodes=None,
                       min_impurity_decrease=0.0, min_impurity_split=None,
                       min_samples_leaf=1, min_samples_split=2,
                       min_weight_fraction_leaf=0.0, presort='deprecated',
                       random_state=47, splitter='best')

15.1.2 Predicting the test set results and calculating the accuracy

In [225]:
print("\033[1m Accuracy of Decision Tree on test set:", "{:.4f}".format(dt.score(X_test, y_test)))
 Accuracy of Decision Tree on test set: 0.4724

15.1.3 Cross Validation

Cross validation attempts to avoid overfitting while still producing a prediction for each observation dataset. We are using 10-fold Cross-Validation to train our model.

In [226]:
from sklearn import model_selection
from sklearn.model_selection import cross_val_score
kfold = model_selection.KFold(n_splits=10, random_state=47)
modelCV = dt
scoring = 'accuracy'
results = model_selection.cross_val_score(modelCV, X_train, y_train, cv=kfold, scoring=scoring)
print("\033[1m 10-fold cross validation average accuracy:", "{:.4f}".format((results.mean())))
 10-fold cross validation average accuracy: 0.6528

15.1.4 Confusion Matrix

In [227]:
y_pred4 = dt.predict(X_test)
# Making the Confusion Matrix
from sklearn.metrics import confusion_matrix
cm4 = confusion_matrix(y_test, y_pred4)
cmDf4=pd.DataFrame(cm4, index=dt.classes_, columns=dt.classes_)
print('\nConfusion Matrix')
print('-------------------')
print(cmDf4)
print('-------------------')
print("\033[1m The result is telling us that we have: ",(cm4[0,0]+cm4[1,1]),"correct predictions.")
print("\033[1m The result is telling us that we have: ",(cm4[0,1]+cm4[1,0]),"incorrect predictions.")
print("\033[1m We have a total predictions of: ",(cm4.sum()))
Confusion Matrix
-------------------
      0     1
0  3212  4194
1   159   685
-------------------
 The result is telling us that we have:  3897 correct predictions.
 The result is telling us that we have:  4353 incorrect predictions.
 We have a total predictions of:  8250

15.1.5 Classification Report

In [228]:
from sklearn.metrics import classification_report
print('\nclassification_report')
print('------------------------')
print(classification_report(y_test, y_pred4))
classification_report
------------------------
              precision    recall  f1-score   support

           0       0.95      0.43      0.60      7406
           1       0.14      0.81      0.24       844

    accuracy                           0.47      8250
   macro avg       0.55      0.62      0.42      8250
weighted avg       0.87      0.47      0.56      8250

15.1.6 Compute precision, recall, F-measure and support

To quote from Scikit Learn:

The precision is the ratio tp / (tp + fp) where tp is the number of true positives and fp the number of false positives. The precision is intuitively the ability of the classifier to not label a sample as positive if it is negative.

The recall is the ratio tp / (tp + fn) where tp is the number of true positives and fn the number of false negatives. The recall is intuitively the ability of the classifier to find all the positive samples.

The F-beta score can be interpreted as a weighted harmonic mean of the precision and recall, where an F-beta score reaches its best value at 1 and worst score at 0.

The F-beta score weights the recall more than the precision by a factor of beta. beta = 1.0 means recall and precision are equally important.

The support is the number of occurrences of each class in y_test.

In [229]:
#calculate Accuracy, how often is the classifier correct?
print("\nAccuracy of the Decision Tree algorithm:", "{:.2%}".format(metrics.accuracy_score(y_test, y_pred4)))
print("Accuracy: Well, we got a classification rate of", "{:.2%}".format(metrics.accuracy_score(y_test, y_pred4)))
#calculate Precision
print("\nPrecision of the Decision Tree algorithm:", "{:.2%}".format(metrics.precision_score(y_test, y_pred4)))
print("Precision: Precision is about being precise, i.e., how precise our model is. In other words, we can say, when a model makes a prediction, how often it is correct. In our prediction case, when our model predict a loan is about to default, that loan actually defaulted", "{:.2%}".format(metrics.precision_score(y_test, y_pred4)) ,"of the time.")
#calculate Recall
print("\nRecall of the Decision Tree algorithm:", "{:.2%}".format(metrics.recall_score(y_test, y_pred4)))
print("Recall: If there is a loan that defaulted present in the test set, our model can identify it", "{:.2%}".format(metrics.recall_score(y_test, y_pred4)) ,"of the time.")
Accuracy of the Decision Tree algorithm: 47.24%
Accuracy: Well, we got a classification rate of 47.24%

Precision of the Decision Tree algorithm: 14.04%
Precision: Precision is about being precise, i.e., how precise our model is. In other words, we can say, when a model makes a prediction, how often it is correct. In our prediction case, when our model predict a loan is about to default, that loan actually defaulted 14.04% of the time.

Recall of the Decision Tree algorithm: 81.16%
Recall: If there is a loan that defaulted present in the test set, our model can identify it 81.16% of the time.

15.1.7 The ROC Curve

In [230]:
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
dt_roc_auc = roc_auc_score(y_test, dt.predict(X_test))
fpr, tpr, thresholds = roc_curve(y_test, dt.predict_proba(X_test)[:,1])
plt.figure()
plt.plot(fpr, tpr, label= 'Decision Tree (area = %0.4f)' % dt_roc_auc)
plt.plot([0, 1], [0, 1],'r--')
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.05])
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver operating characteristic')
plt.legend(loc="lower right")
plt.savefig('Log_ROC')
plt.show()

16. Real time predictions

Now our model has been built, let me use it for real time predictions.

In [231]:
chosenModel = cols
chosenModel
Out[231]:
Index(['LOAN_AMOUNT', 'INTEREST_RATE', 'MONTHLY_PAYMENT', 'ANNUAL_INCOME',
       'DEBT_TO_INCOME', 'LOAN_TERM_ 36 months', 'LOAN_TERM_ 60 months',
       'LOAN_PURPOSE_car', 'LOAN_PURPOSE_credit_card',
       'LOAN_PURPOSE_debt_consolidation', 'LOAN_PURPOSE_educational',
       'LOAN_PURPOSE_home_improvement', 'LOAN_PURPOSE_house',
       'LOAN_PURPOSE_major_purchase', 'LOAN_PURPOSE_medical',
       'LOAN_PURPOSE_moving', 'LOAN_PURPOSE_other',
       'LOAN_PURPOSE_renewable_energy', 'LOAN_PURPOSE_small_business',
       'LOAN_PURPOSE_vacation', 'LOAN_PURPOSE_wedding',
       'EMPLOYMENT_LENGTH_1-2 Years', 'EMPLOYMENT_LENGTH_3-4 Years',
       'EMPLOYMENT_LENGTH_5-6 Years', 'EMPLOYMENT_LENGTH_7-8 Years',
       'EMPLOYMENT_LENGTH_9-10 Years', 'EMPLOYMENT_LENGTH_< 1 year',
       'EMPLOYMENT_LENGTH_>10 Years', 'HOUSING_no', 'HOUSING_yes'],
      dtype='object')
In [232]:
df_loan_dummies['Probability_to_Default'] = dt.predict_proba(df_loan_dummies[chosenModel])[:,1]
In [233]:
df_loan_dummies["LOAN_ID"] = df['LOAN_ID']
df_loan_dummies["TRUE"]=df_loan_dummies["DEFAULT"]
df_loan_dummies["PREDICTED"]=dt.predict(X)
df_loan_dummies[["LOAN_ID","TRUE","PREDICTED", "Probability_to_Default"]].head(10)
Out[233]:
LOAN_ID TRUE PREDICTED Probability_to_Default
0 263591 1 1 0.607246
1 1613916 0 1 0.607246
2 818934 0 1 0.607246
3 1606612 0 1 0.607246
4 1639932 0 0 0.234574
5 756884 1 1 0.607246
6 1251123 0 1 0.607246
7 15172 0 1 0.607246
8 1503361 0 1 0.607246
9 966958 0 0 0.234574
In [234]:
df_loan_dummies.to_csv('Prob_to_Default.csv', index=False, encoding='utf-8')

17. Deployment

In [235]:
a_1  = int(input("Please enter the loan amount without comma-separated (for example, 20000):"))
a_2  = float(input("Please enter the loan rate in percentage (for example, 17.93):"))
a_3  = float(input("Please enter the monthly payment (for example, 342.94):"))
a_4  = float(input("Please enter the annual income without comma-separated (for example, 344304):"))
a_5  = float(input("Please enter the debt to income ratio in in percentage (for example, 18.47):"))
a_6  = int(input("Does the loan term is 36 months (1 if yes, 0 otherwise)? (for example, 0):"))
a_7  = int(input("Does the loan term is 60 months (1 if yes, 0 otherwise)? (for example, 1):"))
a_8  = int(input("Does the loan purpose is for a car (1 if yes, 0 otherwise)? (for example, 0):"))
a_9  = int(input("Does the loan purpose is for a credit card (1 if yes, 0 otherwise)? (for example, 0):"))
a_10 = int(input("Does the loan purpose is for a debt consolidation (1 if yes, 0 otherwise)? (for example, 1):"))
a_11 = int(input("Does the loan purpose is for an education (1 if yes, 0 otherwise)? (for example, 0):"))
a_12 = int(input("Does the loan purpose is for a home improvement (1 if yes, 0 otherwise)? (for example, 0):"))
a_13 = int(input("Does the loan purpose is for a house (1 if yes, 0 otherwise)? (for example, 0):"))
a_14 = int(input("Does the loan purpose is for a major purchase (1 if yes, 0 otherwise)? (for example, 0):"))
a_15 = int(input("Does the loan purpose is for a medical treatment (1 if yes, 0 otherwise)? (for example, 0):"))
a_16 = int(input("Does the loan purpose is for moving (1 if yes, 0 otherwise)? (for example, 0):"))
a_17 = int(input("Does the loan purpose is for other purpose (1 if yes, 0 otherwise)? (for example, 0):"))
a_18 = int(input("Does the loan purpose is for a renewable_energy (1 if yes, 0 otherwise)? (for example, 0):"))
a_19 = int(input("Does the loan purpose is for a small_business (1 if yes, 0 otherwise)? (for example, 0):"))
a_20 = int(input("Does the loan purpose is for a vacation (1 if yes, 0 otherwise)? (for example, 0):"))
a_21 = int(input("Does the loan purpose is for a wedding (1 if yes, 0 otherwise)? (for example, 0):"))
a_22 = int(input("Does the loan applicant's employment length range between 1 to 2 years (1 if yes, 0 otherwise)? (for example, 1):"))
a_23 = int(input("Does the loan applicant's employment length range between 3 to 4 years (1 if yes, 0 otherwise)? (for example, 0):"))
a_24 = int(input("Does the loan applicant's employment length range between 5 to 6 years (1 if yes, 0 otherwise)? (for example, 0):"))
a_25 = int(input("Does the loan applicant's employment length range between 7 to 8 years (1 if yes, 0 otherwise)? (for example, 0):"))
a_26 = int(input("Does the loan applicant's employment length range between 9 to 10 years (1 if yes, 0 otherwise)? (for example, 0):"))
a_27 = int(input("Is the loan applicant's employment lower than 1 year (1 if yes, 0 otherwise)? (for example, 0):"))
a_28 = int(input("Is the loan applicant's employment higher than 10 years (1 if yes, 0 otherwise)? (for example, 0):"))
a_29 = int(input("Does the loan applicant rent a home (1 if yes, 0 otherwise)? (for example, 0):"))
a_30 = int(input("Does the loan applicant own a home (1 if yes, 0 otherwise)? (for example, 1):"))
new_data = np.array([a_1,a_2,a_3,a_4,a_5,a_6,a_7,a_8,a_9,a_10,a_11,a_12,a_13,a_14,a_15,
                     a_16,a_17,a_18,a_19,a_20,a_21,a_22,a_23,a_24,a_25,a_26,a_27,a_28,
                     a_29,a_30]).reshape(1,-1)
new_pred=dt.predict(new_data)
new_prob=dt.predict_proba(new_data)
int(new_pred[0])
if int(new_pred[0]) == 1:
    print("\033[1m \nThe new loan is predicted to default (Don't give this applicant any money!!!!)\033[1m")
    print("\033[1m \nThe default probability of this applicant is", "{:.4%}".format(max(new_prob[0])))
else:
    print("\033[1m \nThe new loan is not predicted to default (continue checking this applicant)\033[1m")
Please enter the loan amount without comma-separated (for example, 20000):20000
Please enter the loan rate in percentage (for example, 17.93):17.93
Please enter the monthly payment (for example, 342.94):342.94
Please enter the annual income without comma-separated (for example, 344304):344304
Please enter the debt to income ratio in in percentage (for example, 18.47):18.47
Does the loan term is 36 months (1 if yes, 0 otherwise)? (for example, 0):0
Does the loan term is 60 months (1 if yes, 0 otherwise)? (for example, 1):1
Does the loan purpose is for a car (1 if yes, 0 otherwise)? (for example, 0):0
Does the loan purpose is for a credit card (1 if yes, 0 otherwise)? (for example, 0):0
Does the loan purpose is for a debt consolidation (1 if yes, 0 otherwise)? (for example, 1):1
Does the loan purpose is for an education (1 if yes, 0 otherwise)? (for example, 0):0
Does the loan purpose is for a home improvement (1 if yes, 0 otherwise)? (for example, 0):0
Does the loan purpose is for a house (1 if yes, 0 otherwise)? (for example, 0):0
Does the loan purpose is for a major purchase (1 if yes, 0 otherwise)? (for example, 0):0
Does the loan purpose is for a medical treatment (1 if yes, 0 otherwise)? (for example, 0):0
Does the loan purpose is for moving (1 if yes, 0 otherwise)? (for example, 0):0
Does the loan purpose is for other purpose (1 if yes, 0 otherwise)? (for example, 0):0
Does the loan purpose is for a renewable_energy (1 if yes, 0 otherwise)? (for example, 0):0
Does the loan purpose is for a small_business (1 if yes, 0 otherwise)? (for example, 0):0
Does the loan purpose is for a vacation (1 if yes, 0 otherwise)? (for example, 0):0
Does the loan purpose is for a wedding (1 if yes, 0 otherwise)? (for example, 0):0
Does the loan applicant's employment length range between 1 to 2 years (1 if yes, 0 otherwise)? (for example, 1):1
Does the loan applicant's employment length range between 3 to 4 years (1 if yes, 0 otherwise)? (for example, 0):0
Does the loan applicant's employment length range between 5 to 6 years (1 if yes, 0 otherwise)? (for example, 0):0
Does the loan applicant's employment length range between 7 to 8 years (1 if yes, 0 otherwise)? (for example, 0):0
Does the loan applicant's employment length range between 9 to 10 years (1 if yes, 0 otherwise)? (for example, 0):0
Is the loan applicant's employment lower than 1 year (1 if yes, 0 otherwise)? (for example, 0):0
Is the loan applicant's employment higher than 10 years (1 if yes, 0 otherwise)? (for example, 0):0
Does the loan applicant rent a home (1 if yes, 0 otherwise)? (for example, 0):0
Does the loan applicant own a home (1 if yes, 0 otherwise)? (for example, 1):1
 
The new loan is predicted to default (Don't give this applicant any money!!!!)
 
The default probability of this applicant is 60.7246%
In [ ]: